Systems and Methods for Predicting Treatment-Regimen-Related Outcomes

ABSTRACT

Systems and methods are provided for predicting treatment-regimen-related outcomes (e.g., risks of regimen-related toxicities). A predictive model is determined for predicting treatment-regimen-related outcomes and applied to a plurality of datasets. An ensemble algorithm is applied on result data generated from the application of the predictive model. Treatment-regimen-related outcomes are predicted using the predictive model. A combination of machine learning prediction and patient preference assessment is provided for enabling informed consent and precise treatment decisions.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.15/953,544, filed Apr. 16, 2018, entitled “Systems and Methods forPredicting Treatment-Regimen-Related Outcomes,” which is a continuationof U.S. application Ser. No. 15/939,621, filed Mar. 29, 2018, entitled“Systems and Methods for Predicting Treatment-Regimen-Related Outcomes,”which is a continuation of Patent Cooperation Treaty Application No.PCT/US2016/054355, filed Sep. 29, 2016, entitled “Systems and Methodsfor Predicting Treatment-Regimen-Related Outcomes,” which claimspriority to U.S. Provisional Application No. 62/234,763, filed Sep. 30,2015, entitled “Systems and Methods for PredictingTreatment-Regimen-Related Outcomes,” all of which are incorporatedherein by reference in their entireties.

TECHNICAL FIELD

The present disclosure relates generally to the field of riskprediction, and, more specifically, to systems and methods forpredicting treatment-regimen-related outcomes using predictive models.The present disclosure describes a combination of machine learningprediction and patient preference assessment to enable informed consentand precise treatment decisions.

BACKGROUND

Cancer, a genetic disease resulting in abnormal proliferation ofaffected cells, is one of the most common causes of death in many partsof the world. Estimated new cases of cancer in the United States in 2014were over 1.5 million (excluding nonmelanoma skin cancers), andestimated deaths from cancer were in excess of 500,000.

One cancer treatment option is chemotherapy. Chemotherapy is the use ofanticancer drugs to suppress or kill cancerous cells, and is one of themost common treatments for cancer. Tumor cells are characterized by fastgrowth reproduction, local invasion and distant spread (metastases). Inmost cases, chemotherapy works by targeting various cell cycle pathwaysthat are used by the tumor cells to promote their growth and spread. Achemotherapeutic drug may be used alone, in combination with one or moreother chemotherapeutic drugs, in combination with other treatments suchas radiation or surgery, or in combination with biologic agents,targeted agents, immune-therapies or antibody-based therapies. Certainchemotherapy drugs and their combinations may be administered in aspecific order depending on the type of cancer it is being used totreat.

Clinical outcomes, such as efficacy and/or side effects (also known astoxicities) of certain medical treatments such as chemotherapy, areimportant for evaluating or assessing the effect of the treatmentregimens. The prediction of the clinical outcomes plays a critical rolefor developing precision medical treatments. For example, upon diagnosisof cancer and during the planning of treatment options by the physician,additional patient information, such as genetic information ornon-genetic information, may help determine the likelihood of a patientdeveloping regimen-related toxicities. Currently there are no precisemethods or systems that allow a physician to predict an individualpatient's risk of side-effects or toxicities of anticancer regimens.Having such methods or systems would allow for the adoption of precisionmedicine for treatment of cancer. Predicting efficacy and potential sideeffects or toxicities based on genetic or other patient informationrequires an innovative approach because such risk may be associated withcombinations of factors including but not limited to networks of genesfunctioning and interacting together.

SUMMARY

Systems and methods are provided for predicting regimen-related outcomes(e.g., risks of regime-related toxicities). A predictive model isdetermined for predicting regimen-related outcomes and applied to aplurality of datasets. An ensemble algorithm is applied on result datagenerated from the application of the predictive model. Regimen-relatedoutcomes are predicted using the predictive model.

According to one embodiment, a processor-implemented method is providedfor predicting risk of regimen-related toxicities. The method comprises:generating, using the one or more data processors, one or more trainingdatasets and one or more testing datasets based at least in part onclinical data or gene feature data of a plurality of patients;determining, using one or more data processors, one or more initialpredictive models using one or more machine learning algorithms based atleast in part on the one or more training datasets; applying, using theone or more data processors, the one or more initial predictive modelson the one or more training datasets to generate result data;performing, using the one or more data processors, an ensemble algorithmon the result data to generate ensemble data; determining, using the oneor more data processors, one or more final predictive models based atleast in part on the ensemble data; evaluating, using the one or moredata processors, performance of the one or more final predictive modelsbased at least in part on the one or more test datasets; and predicting,using the one or more data processors, regimen-related outcomes usingthe one or more final predictive models.

According to another embodiment, a processor-implemented method isprovided for building a predictive model for predicting regimen-relatedoutcomes. The method comprises: dividing, using one or more dataprocessors, a training dataset into a plurality of sub-datasets;selecting, using the one or more data processors, one or more firsttraining sub-datasets from the plurality of sub-datasets; determining,using the one or more data processors, a first predictive model usingone or more machine learning algorithms based at least in part on theone or more first training sub-datasets; evaluating, using the one ormore data processors, the performance of the first predictive modelusing the plurality of sub-datasets excluding the one or more firsttraining sub-datasets; and determining, using the one or more dataprocessors, a final predictive model based at least in part on theperformance evaluation of the first predictive model.

According to yet another embodiment, a processor-implemented system isprovided for predicting regimen-related outcomes. The system comprises:one or more processors and one or more non-transitory machine-readablestorage media. The one or more processors are configured to: generateone or more training datasets and one or more testing datasets based atleast in part on clinical data or gene feature data of a plurality ofpatients; determine one or more initial predictive models using one ormore machine learning algorithms based at least in part on the one ormore training datasets; apply the one or more initial predictive modelson the one or more training datasets to generate result data; perform anensemble algorithm on the result data to generate ensemble data;determine one or more final predictive models based at least in part onthe ensemble data; evaluate performance of the one or more finalpredictive models based at least in part on the one or more testdatasets; and predict regimen-related outcomes using the one or morefinal predictive models;. The one or more non-transitorymachine-readable storage media are provided for storing a computerdatabase having a database schema that includes and interrelatesclinical data fields, gene feature data fields, result data fields,ensemble data fields and predictive model data fields. The clinical datafields store the clinical data, the gene feature data fields store thegene feature data, and the result data fields store the result data. Theensemble data fields store the ensemble data, and the predictive modeldata fields store parameter data of the initial predictive models andthe final predictive models.

According to yet another embodiment, a processor-implemented system isprovided for building a predictive model for predicting regimen-relatedoutcomes. The system comprises: one or more processors and one or morenon-transitory machine-readable storage media. The one or moreprocessors are configured to: divide a training dataset into a pluralityof sub-datasets; select one or more first training sub-datasets from theplurality of sub-datasets; determine a first predictive model using oneor more machine learning algorithms based at least in part on the one ormore first training sub-datasets; evaluate the performance of the firstpredictive model using the plurality of sub-datasets excluding the oneor more first training sub-datasets; and determine a final predictivemodel based at least in part on the performance evaluation of the firstpredictive model. The one or more non-transitory machine-readablestorage media are provided for storing a computer database having adatabase schema that includes and interrelates training data fields,first predictive model data fields, and final predictive model datafields. The training data fields store the training dataset, the firstpredictive model data fields store parameter data of the firstpredictive model, and the final predictive model data fields storeparameter data of the final predictive model.

In some embodiments, a non-transitory computer-readable medium isencoded with instructions for commanding one or more processors toexecute operations of a method for predicting regimen-related outcomes.The method comprises: generating one or more training datasets and oneor more testing datasets based at least in part on clinical data or genefeature data of a plurality of patients; determining one or more initialpredictive models using one or more machine learning algorithms based atleast in part on the one or more training datasets; applying the one ormore initial predictive models on the one or more training datasets togenerate result data; performing an ensemble algorithm on the resultdata to generate ensemble data; determining one or more final predictivemodels based at least in part on the ensemble data; evaluatingperformance of the one or more final predictive models based at least inpart on the one or more test datasets; and predicting regimen-relatedoutcomes using the one or more final predictive models.

In certain embodiments, a non-transitory computer-readable medium isencoded with instructions for commanding one or more processors toexecute operations of a method for building a predictive model forpredicting regimen-related outcomes. The method comprises: dividing atraining dataset into a plurality of sub-datasets; selecting one or morefirst training sub-datasets from the plurality of sub-datasets;determining a first predictive model using one or more machine learningalgorithms based at least in part on the one or more first trainingsub-datasets; evaluating the performance of the first predictive modelusing the plurality of sub-datasets excluding the one or more firsttraining sub-datasets; and determining a final predictive model based atleast in part on the performance evaluation of the first predictivemodel.

In other embodiments, a non-transitory computer-readable medium isprovided for storing data for access by an application program beingexecuted on a data processing system. The storage medium comprises: adata structure stored in said memory, said data structure includinginformation, resident in a database used by said application program.The data structure includes: one or more clinical data objects stored insaid memory, the clinical data objects containing clinical data of aplurality of patients from said database; one or more gene feature dataobjects stored in said memory, the gene feature data objects containinggene feature data of the plurality of patients from said database; oneor more training data objects stored in said memory, the training dataobjects containing one or more training datasets generated based atleast in part on the clinical data or the gene feature data; one or morepredictive model data objects stored in said memory; one or more initialpredictive model data objects stored in said memory, the initialpredictive model data objects containing parameters of one or moreinitial predictive models determined using one or more machine learningalgorithms based at least in part on the one or more training datasets;one or more result data objects stored in said memory, the result dataobjects containing result data generated by applying the initialpredictive models on the one or more training datasets; one or moreensemble data objects stored in said memory, the ensemble data objectscontaining ensemble data generated by performing an ensemble algorithmon the result data; and one or more final predictive model data objectsstored in said memory, the final predictive model data objectscontaining parameters of one or more final predictive models determinedbased at least in part on the ensemble data. The final predictive modeldata objects are used by said application program for predictingregimen-related outcomes.

In other embodiments, a non-transitory computer-readable medium isprovided for storing data for access by an application program beingexecuted on a data processing system. The storage medium comprises: adata structure stored in said memory, said data structure includinginformation, resident in a database used by said application program.The data structure includes: one or more training data objects stored insaid memory, the training data objects containing a training datasetfrom said database, the training dataset including a plurality ofsub-datasets; one or more first predictive model data objects stored insaid memory, the first predictive model data objects containingparameter data of a first predictive model determined using one or moremachine learning algorithms based at least in part on one or more firsttraining sub-datasets from the plurality of sub-datasets; one or morefinal predictive model data objects stored in said memory, the finalpredictive model data objects containing parameter data of a finalpredictive model determined based at least in part on performanceevaluation of the first predictive model. The final predictive modeldata objects are used by said application program for predictingregimen-related outcomes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an example computer-implemented environment wherein userscan interact with a regimen-related outcome prediction system hosted onone or more servers through a network.

FIG. 2 depicts an example block diagram for building and evaluatingpredictive models.

FIG. 3 depicts an example block diagram for model building.

FIG. 4 depicts an example diagram for patient preference assessment.

FIG. 5 depicts an example flow chart for building and evaluatingpredictive models.

FIG. 6 depicts an example flow chart for model building.

FIG. 7 depicts an example diagram showing a system for predictingregimen-related outcomes.

FIG. 8 depicts an example diagram showing a computing system forpredicting regimen-related outcomes.

FIG. 9-FIG. 48 depict example diagrams showing model building andevaluation in one embodiment.

DETAILED DESCRIPTION 1. Overview

The present disclosure relates to systems and methods for treatingdiseases (e.g., cancer) in a subject. For example, cancer encompasses awide range of conditions, each with a unique disease profile andtreatment regimen. After a subject is diagnosed with a certain type ofcancer, various chemotherapeutic and anticancer treatment options areconsidered based on the cancer type and a variety of its genetic makeupand molecular markers. Additional external information about the patientis collected, including, but not limited to medical history, gender,age, ethnicity, hereditary medical information, genetic information,demographic information, environmental information, and otherinformation related to the individual patient. Such information can beobtained using various methods, including at the point of care throughquestionnaires, from surveys, or from personal health records. One ormore sources of such additional information are used as input for aregimen-related outcome prediction tool to predict regimen-relatedoutcomes, such as risks of regimen-related toxicities. A personalizedrisk profile can be generated and the optimal course of treatment can bedetermined. It should be understood that the systems and methodsdescribed herein are not limited to any particular disease (such ascancer) or any particular treatment regimen.

2. Types of Cancer

The systems and methods provided herein can be used for treating theside effects of a number of cancer types including Acute Lymphoblastic;Acute Myeloid Leukemia; Adrenocortical Carcinoma; AdrenocorticalCarcinoma, Childhood; Appendix Cancer; Basal Cell Carcinoma; Bile DuctCancer, Extrahepatic; Bladder Cancer; Bone Cancer; Osteosarcoma andMalignant Fibrous Histiocytoma; Brain Stem Glioma, Childhood; BrainTumor, Adult; Brain Tumor, Brain Stem Glioma, Childhood; Brain Tumor,Central Nervous System Atypical Teratoid/Rhabdoid Tumor, Childhood;Central Nervous System Embryonal Tumors; Cerebellar Astrocytoma;Cerebral Astrocytoma/Malignant Glioma; Craniopharyngioma;Ependymoblastoma; Ependymoma; Medulloblastoma; Medulloepithelioma;Pineal Parenchymal Tumors of Intermediate Differentiation;Supratentorial Primitive Neuroectodermal Tumors and Pineoblastoma;Visual Pathway and Hypothalamic Glioma; Brain and Spinal Cord Tumors;Breast Cancer; Bronchial Tumors; Burkitt Lymphoma; Carcinoid Tumor;Carcinoid Tumor, Gastrointestinal; Central Nervous System AtypicalTeratoid/Rhabdoid Tumor; Central Nervous System Embryonal Tumors;Central Nervous System Lymphoma; Cerebellar Astrocytoma; CerebralAstrocytoma/Malignant Glioma, Childhood; Cervical Cancer; Chordoma,Childhood; Chronic Lymphocytic Leukemia; Chronic Myelogenous Leukemia;Chronic Myeloproliferative Disorders; Colon Cancer; Colorectal Cancer;Craniopharyngioma; Cutaneous T-Cell Lymphoma; Esophageal Cancer; EwingFamily of Tumors; Extra gonadal Germ Cell Tumor; Extrahepatic Bile DuctCancer; Eye Cancer, Intraocular Melanoma; Eye Cancer, Retinoblastoma;Gallbladder Cancer; Gastric (Stomach) Cancer; Gastrointestinal CarcinoidTumor; Gastrointestinal Stromal Tumor (GIST); Germ Cell Tumor,Extracranial; Germ Cell Tumor, Extragonadal; Germ Cell Tumor, Ovarian;Gestational Trophoblastic Tumor; Glioma; Glioma, Childhood Brain Stem;Glioma, Childhood Cerebral Astrocytoma; Glioma, Childhood Visual Pathwayand Hypothalamic; Hairy Cell Leukemia; Head and Neck Cancer;Hepatocellular (Liver) Cancer; Histiocytosis, Langerhans Cell; HodgkinLymphoma; Hypopharyngeal Cancer; Hypothalamic and Visual Pathway Glioma;Intraocular Melanoma; Islet Cell Tumors; Kidney (Renal Cell) Cancer;Langerhans Cell Histiocytosis; Laryngeal Cancer; Leukemia, AcuteLymphoblastic; Leukemia, Acute Myeloid; Leukemia, Chronic Lymphocytic;Leukemia, Chronic Myelogenous; Leukemia, Hairy Cell; Lip and Oral CavityCancer; Liver Cancer; Lung Cancer, Non-Small Cell; Lung Cancer, SmallCell; Lymphoma, AIDS-Related; Lymphoma, Burkitt; Lymphoma, CutaneousT-Cell; Lymphoma, Hodgkin; Lymphoma, Non-Hodgkin; Lymphoma, PrimaryCentral Nervous System; Macroglobulinemia, Waldenstrom; MalignantFibrous Histiocytoma of Bone and Osteosarcoma; Medulloblastoma;Melanoma; Melanoma, Intraocular (Eye); Merkel Cell Carcinoma;Mesothelioma; Metastatic Squamous Neck Cancer with Occult Primary; MouthCancer; Multiple Endocrine Neoplasia Syndrome, (Childhood); MultipleMyeloma/Plasma Cell Neoplasm; Mycosis Fungoides; MyelodysplasticSyndromes; Myelodysplastic/Myeloproliferative Diseases; MyelogenousLeukemia, Chronic; Myeloid Leukemia, Adult Acute; Myeloid Leukemia,Childhood Acute; Myeloma, Multiple; Myeloproliferative Disorders,Chronic; Nasal Cavity and Paranasal Sinus Cancer; Nasopharyngeal Cancer;Neuroblastoma; Non-Small Cell Lung Cancer; Oral Cancer; Oral CavityCancer; Oropharyngeal Cancer; Osteosarcoma and Malignant FibrousHistiocytoma of Bone; Ovarian Cancer; Ovarian Epithelial Cancer; OvarianGerm Cell Tumor; Ovarian Low Malignant Potential Tumor; PancreaticCancer; Pancreatic Cancer, Islet Cell Tumors; Papillomatosis;Parathyroid Cancer; Penile Cancer; Pharyngeal Cancer; Pheochromocytoma;Pineal Parenchymal Tumors of Intermediate Differentiation; Pineoblastomaand Supratentorial Primitive Neuroectodermal Tumors; Pituitary Tumor;Plasma Cell Neoplasm/Multiple Myeloma; Pleuropulmonary Blastoma; PrimaryCentral Nervous System Lymphoma; Prostate Cancer; Rectal Cancer; RenalCell (Kidney) Cancer; Renal Pelvis and Ureter, Transitional Cell Cancer;Respiratory Tract Carcinoma Involving the NUT Gene on Chromosome 15;Retinoblastoma; Rhabdomyosarcoma; Salivary Gland Cancer; Sarcoma, EwingFamily of Tumors; Sarcoma, Kaposi; Sarcoma, Soft Tissue; Sarcoma,Uterine; Sezary Syndrome; Skin Cancer (Nonmelanoma); Skin Cancer(Melanoma); Skin Carcinoma, Merkel Cell; Small Cell Lung Cancer; SmallIntestine Cancer; Soft Tissue Sarcoma; Squamous Cell Carcinoma, SquamousNeck Cancer with Occult Primary, Metastatic; Stomach (Gastric) Cancer;Supratentorial Primitive Neuroectodermal Tumors; T-Cell Lymphoma,Cutaneous; Testicular Cancer; Throat Cancer; Thymoma and ThymicCarcinoma; Thyroid Cancer; Transitional Cell Cancer of the Renal Pelvisand Ureter; Trophoblastic Tumor, Gestational; Urethral Cancer; UterineCancer, Endometrial; Uterine Sarcoma; Vaginal Cancer; Vulvar Cancer;Waldenstrom Macroglobulinemia; or Wilms Tumor.

3. Cancer Therapy

Chemotherapy is one of the most widely used treatment method for cancer.Chemotherapy can be used alone or in combination with surgery orradiation therapy, or in combination with other anti-cancer agents.These other anti-cancer agents, which can be used alone or incombination with other treatments, include, but are not limited to,monoclonal antibodies, biologic agents, targeted agents,immune-therapies or antibody-based therapies.

A number of chemotherapeutic agents are available today. These agentsinclude, but are not limited to, alkylating agents, antimetabolites,anti-tumor antibiotics, topoisomerase inhibitors and mitotic inhibitors.

While chemotherapy can be quite effective in treating certain cancers,chemotherapy drugs reach all parts of the body, not just the cancercells. Because of this, there may be many side effects during treatment,including tissue damage. For example, oxidative stress, caused directlyor indirectly by chemotherapeutics (e.g. doxorubicin), is one of theunderlying mechanisms of the toxicity of anticancer drugs innoncancerous tissues, including the heart and brain. In addition,extravasation, i.e. the accidental administration of intravenously (IV)infused chemotherapeutic agents into the tissue surrounding the infusionsites, can cause significant injury.

3.1 Types of Chemotherapy

The systems and methods provided herein can be used to predictregimen-related outcomes, including efficacy and toxicity, that can beused by a physician or a patient to tailor a specific treatment regimenin order to optimize the patient's clinical outcomes. For example, achemotherapeutic agent can be administered to patients to treatvirtually any disorder that is now known or that is later discovered tobe treatable with such compounds or combinations thereof. These agentsinclude Alkylating agents, Platinum agents, AnthracyclinesAntimetabolites, Anti-tumor antibiotics, Topoisomerase inhibitors (suchas camptothecin compounds), Podophyllotoxin derivatives,Antimetabolites, antibiotics, anti-tumor antibodies, Taxanes, andMitotic inhibitors. In particular, chemotherapeutic agents include, butare not limited to Amsacrine, Actinomycin, All-trans retinoic acid,Azacitidine, Azathioprine, belustine, Bleomycin, Bortezomib, Busulfan,Camptosar™ (irinotecan HCL), Carmustine, Carboplatin, Capecitabine,Cisplatin, Chlorambucil, Chlomaphazin, Cyclophosphamide, Cytarabine,Cytosine arabinoside, Dacarbazine, Dactinomycin, Daunomycin,Daunorubicin, Docetaxel, Doxifluridine, Doxorubicin, Epirubicin,Epothilone, Etoposide, Fluorouracil, Gemcitabine, Hycamtin™ (topotecanHCL), Hydroxyurea, Idarubicin, Imatinib, Irinotecan, Ifosfamide,Mechlorethamine, Melphalan, Mercaptopurine, Methotrexate, Mithramycin,Mitomycin, Mitomycin C, Mitoxantrone, Mitopodozide, Navelbine™(vinorelbine-5′-noranhydroblastine), Nitrogen mustard, Oxaliplatin,Paclitaxel, Pemetrexed, Procarbazine Teniposide, Tioguanine, Topotecan,Trimethylene thiophosphoramide, Uracil mustard, Valrubicin, Vinblastine,Vincristine, Vindesine, and Vinorelbine, and other compounds derivedfrom Camptothecin and its analogues.

3.2 Side Effects of Chemotherapy

Chemotherapy can cause a variety of side-effects/toxicities, and it isimperative to reduce the severity or frequency of certain toxicitiesassociated with the exposure to the chemotherapy in the patient to bothalleviate suffering and potentially increase the dose, thus increasingthe chance of successful therapy. These toxicities include, but are notlimited to, neurotoxicity, nephrotoxicity, ototoxicity, cardiotoxicity,alopecia, fatigue, cognitive dysfunction, diarrhea nausea, vomiting,mucosal toxicities (mucositis of the gastrointestinal tract andvaginitis), xerostomia, infertility, pulmonary toxicity, low white bloodcell counts, infection, anemia, low platelet counts with or withoutbleeding, and renal failure. Some of these toxicities when severe enoughcan lead to hospitalizations, medical care in an intensive unit andsometimes death. In specific embodiments, the side effects whoseseverity or frequency may be predicted by the systems and methodsprovided herein include: chemotherapy-induced peripheral neuropathy(CIPN) (including damages to certain nerves, which may impair sensation,movement, gland or organ function, or other aspects of health, dependingon the type of nerves affected), chemotherapy-induced nausea andvomiting (CINV), fatigue, oral mucositis, diarrhea and cognitivedysfunction.

The side effect profiles of chemotherapeutic drugs vary considerably interms of short-and long-term side effects. Short term side effectsinclude mostly the toxic effects encountered during or shortly after acourse of chemotherapy. Long-term side effects include latercomplications arising after the conclusion of the course of chemotherapyand may last for many months, years or be permanent. The side effectprofiles vary by type of drug, dosage and treatment regimen, but thereis also considerable variability in side effect profile across patientpopulations and more specifically individual patients. It is thereforehighly desirable to be able to predict the outcomes of a treatmentregimen in terms of both short and long term side effects for eachindividual patient to enable the patient and the physician to make theappropriate choice given the individual patient's circumstances.

4. Obtaining Patient Information

The techniques described herein can be used with all types of patientinformation or information from healthy individuals (e.g. to generatecontrol groups), including, but not limited to medical history, gender,age, ethnicity, hereditary medical information, genetic information,demographic information, environmental information, and otherinformation related to the individual patient. Such information can beobtained using various methods, including at the point of care throughquestionnaires, from surveys, or from personal health records.

Genetic information is generated from genetic material that can becollected from patients or healthy individuals in various ways. In oneembodiment, the material is a sample of any tissue or bodily fluid,including hair, blood, tissue obtained through biopsy, or saliva. Thematerial can be collected at point of care or at home. When collected athome, the patient or healthy individual may be sent a collection kitaccompanied by instructions for collecting the sample and questionnaire.In addition to the genetic sample collection kit, the patient or healthyindividual may be sent a unique identifier which is to be used to linkthe information provided in response to the questionnaire with thegenetic material. DNA can be extracted using techniques known in theart.

A number of techniques can be used to obtain genetic information frommaterial samples. These techniques include, but are not limited to,SNP-arrays to detect SNPs, DNA microarrays to determine gene expressionprofiles, tiling arrays to analyze larger genomic region of interest(e.g. human chromosome), and Nanopore sequencing for cost-effectivehigh-throughput sequencing of the entire genome. It should be understoodthat the systems and methods described herein may be configured toobtain other types of patient information or information from healthyindividuals (e.g. to generate control groups) as input data, includingbut not limited to, proteomic information, transcriptomic information,and metabolomic information.

5. Predictive Modeling

FIG. 1 depicts an example computer-implemented environment wherein users102 (e.g., health care providers) can interact with a regimen-relatedoutcome prediction system 104 hosted on one or more servers 106 througha network 108. The regimen-related outcome prediction system 104 canassist the users 102 to build and/or evaluate one or more predictivemodels for predicting regimen-related outcomes (e.g., risk ofregimen-related out toxicities) for treating diseases in a subject. Inspecific embodiments, the regimen-related outcome prediction system 104is configured to combine machine learning prediction based on the one ormore predictive models and patient preference assessment to enableinformed consent and precise treatment decisions.

In some embodiments, the regimen-related outcome prediction system 104assists the users 102 to obtain genetic information or non-geneticinformation of certain patients (e.g., by any means known to a skilledartisan) for creating analysis datasets and build one or more predictivemodels for predicting outcomes of specific treatment regimens. Datahandling of genetic information or non-genetic information will bedescribed in detail as in FIG. 2. In certain embodiments, theregimen-related outcome prediction system 104 implements the one or morepredictive models to predict outcomes of a specific treatment regimenusing genetic or non-genetic information from an individual patient anddetermine suitability of the treatment regimen for the patient.

In specific embodiments, the regimen-related outcome prediction system104 builds the one or more predictive models using one or moredeterministic models/algorithms. For example, the regimen-relatedoutcome prediction system 104 implements one or more machine learningalgorithms, such as penalized logistic regression, random forests,and/or C5.0, for building the one or more predictive models. Themodeling building process will be described in detail as in FIG. 3. Itshould be understood that other known machine learning algorithms mayalso be implemented for building the one or more predictive models.

In some embodiments, the regimen-related outcome prediction system 104may assists one or more of the users 102 to build and/or evaluate one ormore predictive models through a graphical user interface 116. Forexample, the users 102 (or the system operator) may provide inputs atthe graphical user interface 116 for the regimen-related outcomeprediction system 104 to build the one or more predictive models. Incertain embodiments, the user inputs may include non-genetic informationrelated to individual patients, such as medical history, gender, age,ethnicity, demographic information, and environmental information. Forexample, the user inputs may include patient preferences for diseasetreatments or toxicities. In some embodiments, the regimen-relatedoutcome prediction system 104 may assists one or more of the users 102to predict regimen-related outcomes using the one or more predictivemodels through the graphical user interface 116. For example, theregimen-related outcome prediction system 104 may output a personalizedrisk profile related to a particular disease for an individual patientand the optimal course of treatment of the disease on the graphical userinterface 116.

As shown in FIG. 1, the users 102 can interact with the regimen-relatedoutcome prediction system 104 through a number of ways, such as over oneor more networks 108. One or more servers 106 accessible through thenetwork(s) 108 can host the regimen-related outcome prediction system104. The one or more servers 106 can also contain or have access to theone or more data stores 110 for storing data for the regimen-relatedoutcome prediction system 104, or receive input data (e.g., geneticinformation or non-genetic information) from external sources. It shouldbe appreciated that in alternative embodiments the server 106 may beself-contained and not connected to external networks due to security orother concerns.

FIG. 2 depicts an example block diagram for building and evaluatingpredictive models. As shown in FIG. 2, non-genetic data 202 and/orgenetic data 204 are obtained to generate one or more training datasets206 and one or more testing datasets 208. One or more models 210 arebuilt using the one or more training datasets 206. Prediction resultdata 212 of the one or more models 210 based at least in part on the oneor more training datasets 206 is given as inputs to an ensemblealgorithm 214 for generating ensemble data 216 as a final set ofpredictions. One or more final predictive models 242 are determinedbased at least in part on the ensemble data 216. The one or more finalpredictive models 242 are applied to the one or more test datasets 208for performance evaluation to generate evaluation results 240. The oneor more final predictive models 242 can be applied to new patient dataof an individual patient for prediction of regime-related outcomes.

Specifically, a data handling process is performed (e.g., by theregimen-related outcome prediction system 104) to obtain the non-geneticdata 202 and/or the genetic data 204. The non-genetic data 202 mayinclude certain clinical data 218 of individual patients which can beused to generate the one or more training datasets 206 and the one ormore testing datasets 208. For example, the clinical data 218 includesdiagnosis data, cancer-stage data, regimen related data, and neuropathyrelated data related to individual patients. Binary predictors may begenerated by splitting parameters (e.g., diagnosis factors, cancer-stagefactors, regimen factors) related to the clinical data 218. In someembodiments, a stratified random approach is used to generate the one ormore training datasets 206 and the one or more testing datasets 208 witha numeric regimen categorization and response category so that regimensfor affected subjects are proportionally represented in the trainingdatasets and the testing datasets.

In some embodiments, the genetic data 204 includes gene feature data220, such as data related to one or more SNPs, which can be used togenerate the one or more training datasets 206 and the one or moretesting datasets 208. In some embodiments, original gene feature data220 may include a large number of SNPs, and certain pre-processing stepsand/or a data filtering process may be performed to determine a limitednumber of filtered SNPs 224 from the large number of SNPs to simplifyand improve the subsequent predictive modeling. For example, thepre-processing steps may include removing certain SNPs due to too muchmissing data. Also, highly associated SNPs (e.g., contingency tableagreement ≥0.7) may be removed (e.g., one of each pair of associatedSNPs may be removed). It should be understood that other knownpre-processing steps may also be performed to ensure data quality.

After the pre-processing steps, the filtering process is performedthrough recursive partitioning models (e.g., using 10-foldcross-validation) to determine the filtered SNPs 224. Recursivepartitioning creates a decision tree that strives to correctly classifymembers of a dataset by splitting it into sub-datasets based on severaldichotomous independent variables. Each sub-dataset may in turn be splitan indefinite number of times until the splitting process terminatesafter a particular stopping criterion is reached. For example, thepre-processed SNP dataset is divided into ten sub-datasets. Nine out ofthe ten sub-datasets are selected for recursive partitioning modelingwhich involves building a classification tree. The size of theclassification tree is selected by developing recursive partitioningmodels for each potential tree size on the selected nine sub-datasets.The one sub-dataset other than the selected nine sub-datasets is used todetermine the size of the tree that yields a maximum predictiveaccuracy. Such a process is repeated on each of the possiblearrangements of the ten sub-datasets. Any SNPs that are used in any ofthe recursive partitioning models are kept for predictive modeling. Uponcompletion of the pre-processing steps and the filtering process, thefiltered SNPs 224 are determined for predictive modeling. In specificembodiments, the gene feature data 220 may include one or more selectedSNPs 222 (e.g., as identified in Jean E. Abraham et al., Clinical CancerResearch 20(9), May 1, 2014; McWhinney-Glass et al., Clinical CancerResearch 19(20), Oct. 15, 2013; Won et al., Cancer, 118:2828-36, 2012).

As shown in FIG. 2, the one or more training datasets 206 may begenerated using the non-genetic data 202 (e.g., the clinical data 218)and/or the genetic data 204 (e.g., the selected SNPs 222, the filteredSNPs 224). The one or more models 210 can be built (e.g., by theregimen-related outcome prediction system 104) using the one or moretraining datasets 206. In the process of building the models 210,imbalance in the response may influence models to predict samples intothe majority class. To adjust for imbalance, the regimen-related outcomeprediction system 104 may perform both up-sampling (e.g., selectingadditional minority class subjects with replacement to increase theminority class size) and down-sampling (e.g., sampling the majorityclass to create balance with the minority class). In some embodiments,down-sampling yields better predictive models than up-sampling.

For example, as shown in FIG. 3, the one or more training datasets 206are divided into a plurality of sub-datasets for exploring differentmodels/algorithms (e.g., machine learning models/algorithms). One ormore training sub-datasets 304 are selected from the plurality ofsub-datasets 302 for building a model 308. The performance of the model308 is evaluated using other sub-datasets 306 in the training datasets206. Such a process is repeated for multiple times, and each time adifferent group of sub-datasets are selected from the training datasets206 for building and evaluating a model, until a set of models (e.g.,the one or more models 210) are determined. In certain embodiments, fiverepeats of 10-fold cross-validation are used on the training datasets206 to determine the optimal tuning parameter setting for the models210.

One or more machine learning algorithms, including but not limited to,penalized logistic regression, random forests, and/or C5.0, can beapplied (e.g., by the regimen-related outcome prediction system 104) onthe one or more training datasets 206 for predictive model building(e.g., as shown in FIG. 3). In some embodiments, the penalized logisticregression algorithm can be implemented to find parameter estimates thatmaximize the objective function (e.g. log-likelihood), subject toregularization constraints. One regularization constraint forces theparameter estimates to be smaller (e.g. shrinkage), while the otherregularization constraint essentially forces some parameter estimates tozero (e.g. lasso). The penalized logistic regression algorithm is suitedfor problems where the predictors are highly correlated or when thereare more predictors than subjects. Because the regularization forcessome parameter estimates to zero, a predictive model generated based onthe penalized logistic regression algorithm performs internal variableselection.

In certain embodiments, the random forests (RF) algorithm is atree-based method built on an ensemble of trees. A predictive modelgenerated based on the RF algorithm does the following process manytimes: selects a bootstrap sample of the training dataset and builds atree on the bootstrap sample. Within each tree, a randomly selectednumber of predictors are chosen and the optimal split is selected onlyfrom that sample. One or more tuning parameters for predictive modelgenerated based on the RF algorithm include the number of randomlyselected predictors for each split. Building an ensemble of trees inthis way reduces the variance seen by using just a single tree.

In specific embodiments, the C5.0 algorithm can be used to generate apredictive model. Specifically, the C5.0 algorithm is implemented (e.g.,by the regimen-related outcome prediction system 104) to build asequence of trees. At each step in the sequence, the regimen-relatedoutcome prediction system 104 adjusts each sample's weight based on theaccuracy of the model at each iteration. Samples that are predictedcorrectly are given less weight, while samples that are predictedincorrectly are given more weight. The final model prediction is acombination of the predictions across all trees. It should be understoodthat the systems and methods disclosed herein are not limited topenalized logistic regression, random forests, and C5.0 that are merelydescribed above as examples. Other machine learning algorithms may beimplemented for predictive modeling (e.g., as shown in FIG. 3).

In some embodiments, the ensemble algorithm 214 is trained to combinethe prediction result data 212 optimally to generate the ensemble data216. For example, a weight may be determined for the prediction resultof each of the models 210, and a weighted sum of the prediction resultsof all the models 210 is calculated to generate the ensemble data 216.

In certain embodiments, the ensemble algorithm 214 involves an averagecalculation of the result data 212 generated by applying the models 210on the training datasets 206. In some embodiments, the ensemblealgorithm 214 uses a logistic regression model to combine the resultdata 212 across models. It should be understood that the ensemblealgorithm 214 disclosed herein is not limited to the average calculationand the logistic regression model. The ensemble algorithm 214 mayinclude a stacking technique, a blending technique, or any other knownsecond-level machine learning algorithm in which predictions of acollection of models are combined to form a final set of predictions.

In specific embodiments, the training datasets 206 include a clinicalpredictor dataset, a selected SNP dataset, and a filtered SNP dataset.Correspondingly, the result data 212 includes clinical predictor resultdata, selected SNP result data, and filtered SNP result data. Theensemble algorithm 214 can be applied to a combination of these resultdata. For example, the ensemble algorithm 214 is applied to acombination of the clinical predictor result data and the selected SNPresult data, or a combination of the clinical predictor result data, theselected SNP result data, and the filtered SNP result data.

As shown in FIG. 2, the final predictive models 242 are applied to thetesting datasets 208 to generate the evaluation results 240. As anexample, the evaluation results 240 include sensitivity or specificityparameters for the one or more final predictive models 242. In certainembodiments, the one or more final predictive models 242 provided hereincan be used to predict the occurrence of a side effect (such as thoselisted in Section 3.2) during the treatment of a cancer (such as thoselisted in Section 2) with a therapy (such as those listed in Section3.1) with an accuracy of at least 85%, 86%, 87%, 88%, 89%, 90%, 91%,92%, 93%, 94%, 95%, 96%, 97%, 98%, or at least 99%.

In specific embodiments, the systems and methods disclosed herein (e.g.,the regimen-related outcome prediction system 104) are configured tocombine model prediction and patient preferences for generatingindividualized patient reports so that treatment options tailored forindividual patients can be determined. For example, patients diagnosedwith cancer face important decisions, in partnership with theirphysicians, regarding chemotherapy options. The patients may weigh thepotential clinical benefit against the potential toxicities of theavailable therapies and their likely effects on quality of life. Patientpreferences for cancer therapy incorporate a patient's understanding ofthe relative benefit and harm of the various alternatives. Understandinga patient's preferences can better inform clinical decision-making.

As an example, three treatment options may be presented to a patientwith breast cancer: 1. dose-dense doxorubicin/cyclophosphamide (AC) forfour cycles, followed by dose-dense paclitaxel (T) for a first number ofweeks with granulocyte-colony stimulating factor (G-CSF) support; 2.dose-dense AC for the first number of weeks for four cycles, followed bypaclitaxel (T) weekly for twelve weeks; 3. docetaxel/cyclophosphamide(TC) for a second number of weeks for six cycles. The regimen-relatedoutcome prediction system 104 determines a personalized genomic riskprofile related to these three treatment regimens for the patient, e.g.,as shown in Table 0. Particularly, each number shown in a particularcell of Table 0 refers to a percentage risk for a correspondingside-effect/toxicity.

TABLE 0 Oral Cognitive Peripheral Chemotherapy CINV mucositis Diarrheadysfunction Fatigue neuropathy 1. Dose-dense >90 <10 >90 <10 70 <10 AC +T 2. Dose-dense >90 <10 >90 <10 70 <10 AC + weekly T 3. TC 50 >90 >90 3070 <10

As shown in Table 0, the patient has a high risk of CINV for the firsttreatment option and the second treatment option and a high risk of oralmucositis for the third treatment option. For all three treatmentoptions, the patient has a high risk of diarrhea, a moderate-high riskof fatigue, and a low risk of cognitive dysfunction and peripheralneuropathy.

The informed consent for the treatment regimens may be obtained from thepatient based on the risk profile. For example, the patient may beinformed of lowering of white blood cells, red blood cells, andplatelets (CBC) and associated risks. In addition, the patient may beinformed of the risk of CINV, diarrhea, dehydration, electrolyteimbalance, organ damage, fatigue, hair loss, infusion reactions,allergic reactions, etc. Also, the patient may be informed of the riskof cardiac dysfunction due to doxorubicin, bleeding in the bladder dueto cyclophosphamide, and other side effects that can be severe and causedeath.

FIG. 4 depicts an example diagram for patient preference assessment. Asshown in FIG. 4, a visual analog scale is designed for quantifying thepatient's willingness to tolerate side effects. For example, on theanalog scale, a score of 100 is set for perfect health, and a score of 0is set for death. Patient preferences with respect to different sideeffects can be quantified on the analog scale.

Specifically, as shown in FIG. 4, current health of the patient, forexample, is ranked at 82 out of 100. Preference assessment shows thatthe patient is least willing to tolerate peripheral neuropathy (rankedat 10 out of 100) and fatigue (ranked at 20 out of 100).

A treatment regimen can be selected based on the combination of thepersonalized risk profile and the patient preference assessment. Forexample, for this particular patient with breast cancer, the secondtreatment option, dose-dense AC and weekly paclitaxel, may be selectedas the preferred therapy. This allows a clinician to plan bestsupportive care, for example, including: palonosetron and dexamethasoneprior to chemotherapy, two additional days of dexamethasone afterchemotherapy for prevention of CINV, daily IV hydration and loperamidefor prevention of diarrhea. In addition, the nursing staff may provideteaching to monitor the patient's temperature daily.

In some embodiments, chemotherapy regimens or agents may be switched toavoid side effects yet maintain the anti-cancer effect of atherapeutically equivalent regimen. Further, supportive care agents toprevent and/or ameliorate side effects may be planned accordingly.

In certain embodiments, the regimen-related outcome prediction system104 is configured to implement one or more methods for assessing patientpreferences. For example, several methods of assessing patientpreferences in oncology treatment can be used: 1) standard gamble (SG),2) time trade-off (TTO), 3) ranking or rating scale, and 4) visualanalogue scale (VAS). In some embodiments, these methods are combined indifferent manners to effectively assess patient preferences.

The SG method is a quantitative assessment of patient preferences basedon modern (or expected) utility theory, and is a method ofdecision-making under uncertainty that incorporates the decision maker'sfundamental preferences in the decision process. Utility in this contextrefers to the desirability or preference of the individual for a givenoutcome expressed in a cardinal number; utility methods enable thedecision-maker to reach a rational decision that is consistent withhis/her personal preferences. Use of utility methods is based onhealth-related quality of life conditions (e.g., as described inTorrance et al., Journal of chronic diseases, 40.6, 1987). The SGtechnique may be implemented to measure utilities and used in clinicalsituations to help individual patients reach healthcare-relateddecisions (e.g., as described in Torrance et al., Journal of chronicdiseases, 40.6, 1987). Because the individual's choices are made underuncertainty, this technique most closely resembles the uncertainty ofthe clinical situation and is considered to be the ‘gold standard’ ofpreference assessment tools.

In the SG method, patients choose either a gamble between perfect health(for a set time) and immediate death or a certainty of living in anintermediate health state (between perfect health and death) for a settime. Perfect health has a probability of P and death has a probabilityof 1-P. The value of P is varied until the patient is indifferent to thechoice between the gamble and the certain intermediate health state, atwhich point P is the utility for the certain intermediate health state(e.g., as described in Blinman et al., Ann. Oncol., 23: 1104-1110,2012). The treatment with highest expected utility may be the preferredtreatment.

The TTO method was developed as an alternative for SG, specifically foruse in healthcare settings (e.g., as described in Torrance et al.,Journal of chronic diseases, 40.6, 1987). The TTO does not involveprobabilities and is easier for patients to use. It involves trade-offsbetween two alternative health states, although patient decisions aremade under conditions of certainty, which lessens its similarity toclinical realities. Patients choose either an intermediate health statefor a given time (t) or perfect health for less than that given time(x<t) followed by immediate death. The duration (x) is modified untilthe patient is indifferent between the two alternatives.

The TTO technique has been validated against SG for assessment of healthstates ‘preferred to death’ and found to give similar results (asdescribed in Torrance et al., Journal of chronic diseases, 40.6, 1987).The systems described herein may implement the TTO technique todetermine preferences of treatment in breast, ovarian, colon, and lungcancers (e.g., as described in Sun et al., Oncology, 87: 118-128, 2002;Simes et al., Psycho-Oncology, 15: 1001-1013, 2001; Duric et al., Br. J.Cancer, 93(12): 1319-1323, 2005; Blinman et al., Eur. J. Cancer, 46(10):1800-1807, 2010; Blinman et al., Lung Cancer, 72(2): 213-218, 2011).

The rating scale method is a quick and easy assessment tool in whichpatients are asked to rate a set of available options (i.e.,chemotherapies; side effects) on a Likert scale or other scales, withthe most preferred health state at one end and the least preferred atthe other end. The rating scale method provides ordinal data aboutpatient preferences (e.g., as described in Blinman et al., Ann Oncol.,23: 1104-1110, 2012). Rating scale results may require a power curvecorrection for optimal reliability (e.g., as described in Torrance etal., Journal of chronic diseases, 40.6, 1987).

The VAS method is simple for patients and caregivers to use, andpatients may be asked to choose their health preference on a visuallinear rating scale, with the scale anchored on a line by perfect healthand death. The VAS ratings are made under conditions of certainty, haveno trade-offs, and contain some measurement biases. The VAS may bebetter used in combination with other methods. The systems describedherein may implement the VAS to determine patient preferences for cancerchemotherapy for ovarian cancer (e.g., as described in Sun et al.,Oncology, 87: 118-128, 2002 and Sun et al., Support Care Cancer, 13:219-227, 2005) and for cervical cancer (e.g., as described in Sun etal., Int. J. Gynecol. Cancer, 24(6): 1077-1084, 2014).

The choice of method(s) for assessing patient preferences depends on theclinical situation, goal of the assessment, and degree of acceptablepatient burden, with some of the methods being more cognitivelydemanding than others. Multiple methods may be, and often are, combinedto assess preferences most effectively. Determining patient preferenceis key in facilitating the most appropriate therapeutic decisions.

A number of clinical and demographic factors may influence a patient'spreferences for cancer treatments. Incorporating the patient'sperspective and value judgments into decisions about their treatment,for example, the trade-offs between the relative preference for improvedsurvival and potential side effects, can help guide therapeutic regimensthat are most effective and well-tolerated for the patient. Patientpreferences may change depending on the situation. Whether a patient hasalready been treated or has already made a treatment decision can affectthe outcomes of the assessment, so results should be considered in thecontext of the patient's situation (e.g., as described in Stiggelbout etal., J. Clin. Oncol., 19(1): 220-230, 2001).

In some embodiments, the systems and methods disclosed herein areconfigured to combine patient preference data, collected using wellvalidated tools, with other patient results such as genomic analyses, ina shared medical decision model with the patient to obtain the mostfavorable treatment results.

FIG. 5 depicts an example flow chart for building and evaluatingpredictive models. As shown in FIG. 5, at 402, one or more trainingdatasets and one or more testing datasets are generated based at leastin part on clinical data or gene feature data of a plurality ofpatients. For example, the clinical data includes diagnosis data,cancer-stage data, regimen related data, and neuropathy related datarelated to individual patients. The gene feature data may include one ormore predetermined SNPs and/or one or more filtered SNPs which aregenerated through certain pre-processing steps and a filtering process.At 404, one or more initial predictive models are determined using oneor more machine learning algorithms based at least in part on the one ormore training datasets (e.g., as shown in FIG. 6). For example, the oneor more machine learning algorithms correspond to one or more of thefollowing: penalized logistic regression, random forests, and C5.0.

At 406, the one or more initial predictive models are applied on the oneor more training datasets to generate result data. For example, thetraining datasets include a clinical predictor dataset, a selected SNPdataset, and a filtered SNP dataset. At 408, an ensemble algorithm isperformed on the result data to generate ensemble data. For example, theensemble algorithm corresponds to an average calculation or a logisticregression model. The ensemble algorithm may be applied to a combinationof clinical predictor result data and selected SNP result data, or acombination of the clinical predictor result data, the selected SNPresult data, and filtered SNP result data. At 410, one or more finalpredictive models are determined based at least in part on the ensembledata. At 412, performance of the one or more final predictive models isevaluated based at least in part on the one or more test datasets. At414, regimen-related outcomes are predicted using the one or more finalpredictive models.

FIG. 6 depicts an example flow chart for model building. As shown inFIG. 6, at 502, a training dataset is divided into a plurality ofsub-datasets. At 504, one or more first training sub-datasets areselected from the plurality of sub-datasets. At 506, a first model isdetermined using one or more machine learning algorithms based at leastin part on the one or more first training sub-datasets. At 508, theperformance of the first model is evaluated using the plurality ofsub-datasets excluding the one or more first training sub-datasets. Sucha process is repeated multiple times, and each time a different group ofsub-datasets are selected from the training dataset for model buildingand evaluation. At 510, a predictive model is determined based at leastin part on the performance evaluation of the first model and othermodels generated through the different iterations based on differentgroups of sub-datasets selected from the training dataset.

FIG. 7 depicts an example diagram showing a system for predictingregimen-related outcomes. As shown in FIG. 7, the system 10 includes acomputing system 12 that contains a processor 14, a storage device 16and a regimen-related outcome prediction module 18. The computing system12 includes any suitable type of computing device (e.g., a server, adesktop, a laptop, a tablet, a mobile phone, etc.) that includes theprocessor 14 or provide access to a processor via a network or as partof a cloud-based application. The regimen-related outcome predictionmodule 18 includes tasks (e.g., corresponding to steps shown in FIG. 4)and is implemented as part of a user interface module (not shown in FIG.7).

FIG. 8 depicts an example diagram showing a computing system forpredicting regimen-related outcomes. As shown in FIG. 8, the computingsystem 12 includes a processor 14, memory devices 1902 and 1904, one ormore input/output devices 1906, one or more networking components 1908,and a system bus 1910. In some embodiments, the computing system 12includes the regimen-related outcome prediction module 18, and providesaccess to the regimen-related outcome prediction module 18 to a user asa stand-alone computer.

6. Exemplary Embodiment

This embodiment is merely an example, which should not unduly limit thescope of the claims. One of ordinary skill in the art would recognizemany variations, alternatives, and modifications. For example, anendpoint for evaluation in this embodiment is binary severityclassification of chemotherapy induced peripheral neuropathy (CIPN). Theobjectives for the endpoint are to:

-   1. Build and evaluate predictive models based on a biased predictor    selection approach; and-   2. Build and evaluate predictive models based on an unbiased    predictor selection approach.-   It should be understood that the systems and methods described    herein can be configured to adopt other endpoints and related    objectives for model building/evaluation and outcome prediction.

As an example, predictive models are built and evaluated based on anunbiased predictor selection approach for the endpoint (CIPN) andcontains the following processes: data handling, descriptive analyses,and predictive modeling of classification outcomes.

6.1 Data Handling

A clinical dataset is created through one or more of the followingsteps:

-   1. importing patient level covariates and the CIPN endpoint;-   2. splitting a diagnosis factor into individual binary predictors    for modeling (e.g. breast=1 for breast cancer, 0 otherwise, etc.);-   3. splitting a stage factor into individual binary predictors for    modeling (e.g. stage 1=1 for stage 1 patients, 0 otherwise, etc.);-   4. splitting a regimen factor into individual binary predictors for    modeling (e.g. CIPNreg 1=1 if CIPNreg n=1, 0 otherwise, etc.);-   5. keeping subjects with quality values of 0 (e.g., data of decent    quality with no major problems) or 1 (high neuropathy scores    immediately prior to the first cycle of treatment);-   6. using CIPN score data (e.g., maximum neuropathy score during the    first 9 cycles) as the endpoint and categorizing the data into    unaffected (e.g., <4) or affected (e.g., ≥4). If CIPN score data is    missing, then it is imputed only if the patient exceeded the    criterion in the previous time period; and-   7. splitting data into a training set (75%) and test set (25%) using    a stratified random approach using a numeric regimen categorization    and response category in order to ensure that regimens for affected    subjects are proportionally represented in training and testing    sets.

The original SNP data contains approximately 2.3 million unique SNPs.The following pre-processing steps and the data filtering process aretaken prior to predictive modeling:

-   1. The following ACORNNO are removed due to too much missing data    across SNPs: 38, 102, 211, and 320.-   2. Highly associated SNPs (contingency table agreement ≥0.7) are    removed. If a pair of SNPs have high association, then the first SNP    is kept and the second is removed. The number of SNPs after this    process is approximately 620 K.-   3. Less than 0.1% of the data is missing (e.g., labeled as “U”).    Because the percentage of missing values is very small, the values    are imputed to label “H” in order to prevent computational errors in    the model training process.-   4. Using only the subjects in the training data, recursive    partitioning models using 10-fold cross-validation are trained. Any    SNP that is used in any of the recursive partitioning models is    kept. This process identifies approximately 4300 SNPs as relevant to    the response.

The resulting SNPs are referred to as “Filtered SNPs” in the predictivemodeling below (e.g., Section 6.3). The training set has 152 samples andthe test set has 48 samples. The distribution of affected/unaffectedsubjects by training/test set is presented in Table 1.

TABLE 1 Affected Unaffected Test 35 13 Training 110 42

6.2 Descriptive Analysis

FIG. 9 depicts an example diagram showing the distribution of age forthe entire data set, and within the training and test splits. Theaverage age for all data and within the training and test sets is 56,55.8, and 56.6, respectively. Overall, the distribution of age issimilar across the sets, indicating no bias in age in the selectionprocess between training and test sets.

Tables 2 and 3 provide the counts and percents for the remainingdemographic variables and regimens for all data and within the trainingand test sets. The denominators for computing the percents in thesetables are 200 (All), 152 (Train), and 48 (Test). Similar to age, thepercent of patients in the training and test sets are similar across thedemographic variables and regimens indicating no bias in therandomization process. Table 4 provides the counts and percents ofsubjects who were affected within each regimen for all data and withinthe training and test sets. The denominators for computing the percentsin these tables are 200 (All), 152 (Train), and 48 (Test).

TABLE 2 All Training Training All (n) (%) (n) (%) Test (n) Test (%) male40 20.0 20 19.7 10 20.8 diabetes 36 18.0 28 18.4 8 16.7 breast 95 47.573 48.0 22 45.8 colorectal 58 29.0 45 29.6 13 27.1 muscle 16 8.0 11 7.25 10.4 ovarian 26 13.0 20 13.2 6 12.5 prostate 5 2.5 3 2.0 2 4.2 Stage133 16.5 24 15.8 9 18.8 Stage2 61 30.5 49 32.2 12 25.0 Stage3 60 30.0 4831.6 12 25.0 Stage4 40 20.0 26 17.1 14 29.2 Stage5 6 3.0 5 3.3 1 2.1

TABLE 3 All All Training Training Test Test Regimen (n) (%) (n) (%) (n)(%) Alimta +/− Avastin 1 0.5 1 0.7 0 0.0 Carbo/Alimta +/− 3 1.5 2 1.3 12.1 Avastin Carbo/Docetaxel +/− 12 6.0 10 6.6 2 4.2 Herceptin Carbo/Gem1 0.5 1 0.7 0 0.0 Carbo/Taxel +/− 30 15.0 22 14.5 8 16.7 BiologicCisplt + Alimta 2 1.0 2 1.3 0 0.0 Docetaxel + 16 8.0 12 7.9 4 8.3Cytoxan +/− Herceptin Docetaxel + Pred q21 5 2.5 3 2.0 2 4.2 Dose DenseAC + 68 34.0 51 33.6 17 35.4 Taxol FOLFIRI +/− Avastin 4 2.0 4 2.6 0 0.0FOLFOX4 +/− 1 0.5 1 0.7 0 0.0 Avastin FOLFOX6 12 6.0 10 6.6 2 4.2MFOLFOX6 +/− 41 20.5 30 19.7 11 22.9 Avastin TC/Tykerb 1 0.5 1 0.7 0 0.0TCH +/− Tykerb 1 0.5 1 0.7 0 0.0 Wkly Taxol/Carbo +/− 2 1.0 1 0.7 1 2.1Avastin

TABLE 4 All All Training Training Test Test Regimen (n) (%) (n) (%) (n)(%) Alimta +/− Avastin 1 6.5 1 0.7 0 0.0 Carbo/Docetaxel +/− 10 5.0 85.3 2 4.2 Herceptin Carbo/Gem 1 0.5 1 0.7 0 0.0 Carbo/Taxel +/− 23 11.518 11.8 5 10.4 Biologic Cisplt + Alimta 2 1.0 2 1.3 0 0.0 Docetaxel + 157.5 11 7.2 4 8.3 Cytoxan +/− Herceptin Docetaxel + Pred q21 1 0.5 1 0.70 0.0 Dose Dense AC + 49 24.5 36 23.7 13 27.1 Taxol FOLFIRI +/− Avastin2 1.0 2 1.3 0 0.0 FOLFOX4 +/− 1 0.5 1 0.7 0 0.0 Avastin FOLFOX6 11 5.5 95.9 2 4.2 MFOLFOX6 +/− 27 13.5 19 12.5 8 16.7 Avastin TC/Tykerb 1 0.5 10.7 0 0.0 Wkly Taxol/Carbo +/− 1 0.5 0 0.0 1 2.1 Avastin

6.3 Predictive Modeling of Classification Outcomes

The predictive ability for three distinct predictors sets isinvestigated:

-   1. Clinical predictors-   2. Selected SNPs-   3. Filtered SNPs

For example, the selected SNPs are determined as SNPs identified inthree manuscripts (i.e., Jean E. Abraham et al., Clinical CancerResearch 20(9), May 1, 2014; McWhinney-Glass et al., Clinical CancerResearch 19(20), Oct. 15, 2013; Won et al., Cancer, 118:2828-36, 2012)as being related to the endpoint of interest. In total, 24 SNPs wereidentified in these manuscripts. 14 of these 24 SNPs are used for theanalysis herein. Many predictive models are explored for this analysis.In the process of building models, imbalance in the response influencesmodels that place samples into the majority class. To adjust forimbalance, both up-sampling (selecting additional minority classsubjects with replacement to increase the minority class size) anddown-sampling (sampling the majority class to create balance with theminority class) are explored. In the data of this embodiment,down-sampling yielded better models than up-sampling. This is likely dueto the small training set size.

For all models, 5 repeats of 10-fold cross-validation are used on thetraining set to determine the optimal tuning parameter setting. For thedown-sampled data, many models are explored. The three models to bedescribed in detail herein are penalized logistic regression, randomforests (RF), and C5.0. A brief explanation of each method has beenprovided in Section 5.

The results from each of these models are obtained for the followingsubsets of data: clinical predictors, selected SNPs, and filtered SNPs.To give equal weight to each of these predictor subsets, a simpleensemble and a model-based ensemble are constructed for the followingcombinations: clinical predictors and selected SNPs; and clinicalpredictors, selected SNPs, and filtered SNPs. The simple ensembleapproach takes the average of the model predictions, while themodel-based approach uses a logistic regression model to combine thepredictions across models.

The three modeling techniques have similar predictive performance, butC5.0 is computationally more efficient, in some circumstances. Forexample, C5.0 performs better than the other two models on the selectedSNP subset, while RF performs slightly better than the other two modelson the filtered SNP subset. The penalized logistic regression modelperforms better than the other two models on the model-based ensembleacross the clinical, selected SNPs, and filtered SNPs predictors.

6.3.1 C5.0 6.3.1.1 Clinical Predictors

FIG. 10 depicts an example diagram showing the tuning parameter profilefor the C5.0 model related to clinical predictors. The optimal number oftrials (iterations) for this data is 3. The distribution ofaffected/unaffected subjects based on the clinical predictors ispresented in Table 5, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. FIG. 11 depicts an example diagram showinga receiver-operating-characteristic (ROC) curve related to the clinicalpredictors. FIG. 12 depicts an example diagram showing top importantpredictors among the clinical predictors.

TABLE 5 Affected Unaffected Affected 22 2 Unaffected 13 11

6.3.1.2 Selected SNPs

FIG. 13 depicts an example diagram showing the tuning parameter profilefor the C5.0 model related to the selected SNPs. The optimal number oftrials (iterations) for this data is 2. The distribution ofaffected/unaffected subjects based on the selected SNPs is presented inTable 6, where Y-axis indicates predicted affected/unaffected subjectsand X-axis indicates observed affected/unaffected subjects. FIG. 14depicts an example diagram showing a ROC curve related to the selectedSNPs. FIG. 15 depicts an example diagram showing the top importantpredictors among the selected SNPs.

TABLE 6 Affected Unaffected Affected 21 6 Unaffected 14 7

6.3.1.3 Simple Ensemble Approach: Clinical Predictions and Selected SNPsPredictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. FIG. 16 depicts an example diagram showing aROC curve based on the average of clinical predictions and selected SNPspredictions. The distribution of affected/unaffected subjects based onthe simple ensemble approach is presented in Table 7, where Y-axisindicates predicted affected/unaffected subjects and X-axis indicatesobserved affected/unaffected subjects. Table 8 includes prediction dataof the simple ensemble approach from calculation based on an average ofthe model probabilities.

TABLE 7 Affected Unaffected Affected 27 3 Unaffected 8 10

TABLE 8 Observed Clinical Selected Ensemble Ensemble Pred Affected 0.320.27 0.29 Unaffected Affected 1.00 0.12 0.56 Affected Affected 1.00 0.750.88 Affected Affected 0.33 0.79 0.56 Affected Affected 0.33 0.12 0.23Unaffected Affected 1.00 0.27 0.64 Affected Affected 0.00 0.79 0.40Unaffected Affected 0.32 0.12 0.22 Unaffected Unaffected 0.00 0.70 0.35Unaffected Unaffected 0.00 0.75 0.38 Unaffected Affected 0.40 0.27 0.33Unaffected Unaffected 0.00 0.90 0.45 Unaffected Affected 0.71 0.78 0.75Affected Unaffected 0.32 0.78 0.55 Affected Unaffected 0.40 0.27 0.33Unaffected Affected 0.66 0.79 0.73 Affected Affected 0.71 0.79 0.75Affected Affected 0.40 0.12 0.26 Unaffected Unaffected 0.29 0.78 0.54Affected Affected 1.00 0.12 0.56 Affected Affected 0.66 0.78 0.72Affected Affected 1.00 0.75 0.88 Affected Affected 0.33 0.78 0.56Affected Affected 1.00 0.78 0.89 Affected Unaffected 0.33 0.27 0.30Unaffected Affected 0.32 0.78 0.55 Affected Unaffected 0.33 0.27 0.30Unaffected Affected 0.70 0.78 0.74 Affected Affected 0.66 0.78 0.72Affected Affected 0.70 0.78 0.74 Affected Affected 1.00 0.79 0.90Affected Affected 1.00 0.27 0.64 Affected Affected 1.00 0.27 0.64Affected Unaffected 0.00 0.27 0.14 Unaffected Affected 1.00 0.79 0.90Affected Affected 0.66 0.90 0.78 Affected Affected 1.00 0.78 0.89Affected Affected 0.33 0.78 0.56 Affected Affected 1.00 0.12 0.56Affected Unaffected 0.66 0.27 0.47 Unaffected Unaffected 0.69 0.78 0.74Affected Unaffected 0.00 0.12 0.06 Unaffected Unaffected 0.33 0.27 0.30Unaffected Affected 0.29 0.27 0.28 Unaffected Affected 0.00 0.12 0.06Unaffected Affected 1.00 0.75 0.88 Affected Affected 1.00 0.27 0.64Affected Affected 0.36 0.90 0.63 Affected

6.3.1.4 Model-Based Ensemble Approach: Clinical Predictions and SelectedSNPs Predictions

A logistic regression model is built on the hold-out datasets in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected SNPs predictions is presented in Table9, where Y-axis indicates predicted affected/unaffected subjects andX-axis indicates observed affected/unaffected subjects. FIG. 17 depictsan example diagram showing a ROC curve based on clinical predictions andselected SNPs predictions using a model. Table 10 includes predictiondata of the mode-based ensemble approach from calculation based on amodel of the model probabilities.

TABLE 9 Affected Unaffected Affected 12 11 Unaffected 23 2

TABLE 10 Observed Clinical Selected Model Model Pred. Affected 0.32 0.270.52 Affected Affected 1.00 0.12 0.40 Unaffected Affected 1.00 0.75 0.43Unaffected Affected 0.33 0.79 0.54 Affected Affected 0.33 0.12 0.56Affected Affected 1.00 0.27 0.41 Unaffected Affected 0.00 0.79 0.60Affected Affected 0.32 0.12 0.51 Affected Unaffected 0.00 0.70 0.59Affected Unaffected 0.00 0.75 0.59 Affected Affected 0.40 0.27 0.50Affected Unaffected 0.00 0.90 0.60 Affected Affected 0.71 0.78 0.48Unaffected Unaffected 0.32 0.78 0.55 Affected Unaffected 0.40 0.27 0.50Affected Affected 0.66 0.79 0.49 Unaffected Affected 0.71 0.79 0.48Unaffected Affected 0.39 0.12 0.49 Unaffected Unaffected 0.29 0.78 0.55Affected Affected 1.00 0.12 0.40 Unaffected Affected 0.66 0.78 0.49Unaffected Affected 1.00 0.75 0.42 Unaffected Affected 0.33 0.78 0.54Affected Affected 1.00 0.78 0.44 Unaffected Unaffected 0.33 0.27 0.51Affected Affected 0.32 0.78 0.55 Affected Unaffected 0.33 0.27 0.51Affected Affected 0.70 0.78 0.48 Unaffected Affected 0.66 0.78 0.49Unaffected Affected 0.70 0.78 0.48 Unaffected Affected 1.00 0.79 0.44Unaffected Affected 1.00 0.27 0.41 Unaffected Affected 1.00 0.27 0.41Unaffected Unaffected 0.00 0.27 0.57 Affected Affected 1.00 0.79 0.44Unaffected Affected 0.66 0.90 0.50 Unaffected Affected 1.00 0.78 0.44Unaffected Affected 0.33 0.78 0.54 Affected Affected 1.00 0.12 0.40Unaffected Unaffected 0.66 0.27 0.46 Unaffected Unaffected 0.69 0.780.49 Unaffected Unaffected 0.00 0.12 0.56 Affected Unaffected 0.33 0.270.51 Affected Affected 0.29 0.27 0.52 Affected Affected 0.00 0.12 0.56Affected Affected 1.00 0.75 0.43 Unaffected Affected 1.00 0.27 0.41Unaffected Affected 0.26 0.90 0.55 Affected

6.3.1.5 Filtered SNPs

As described in Section 6.1, SNPs are identified as potentiallyimportant using a cross-validated recursive partitioning approach. Intotal, 4368 SNPs are identified in this process. The top 200 SNPs basedon variable importance ranking are presented in Table 11.

TABLE 11 C1 C2 C3 C4 C5 rs1009701 rs9530263 rs6973851 kgp11276660kgp4822821 rs10250195 rs978854 rs7152412 kgp11349675 kgp5013508rs10491984 kgp12147164 rs7204751 kgp11525452 kgp5154577 rs10511271kgp685769 rs741339 kgp11590177 kgp5322406 rs10853124 rs12186512rs7639756 kgp1171450 kgp5666989 rs10963188 rs2131210 rs7895805kgp11880615 kgp5781005 rs11157216 rs6619 rs7873103 kgp12025888kgp5894054 rs11257804 rs7382146 rs834767 kgp12150323 kgp6053642rs11898628 rs10150423 rs9333240 kgp12187497 kgp618023 rs12185748rs10193760 rs9530262 kgp1232554 kgp6253679 rs1244791 rs10489829rs9599040 kgp12475652 kgp6525677 rs12645173 rs10836174 rs9938218kgp1296418 kgp6704767 rs13077887 rs10004696 rs13164140 kgp1570784kgp6793080 rs1384681 rs110773 rs1982002 kgp16905 kgp6846921 rs1464205rs1158602 rs6437137 kgp176973 kgp7014989 rs1478926 rs11689427 rs7204283kgp1849314 kgp7125852 rs1567083 rs12443665 rs4858533 kgp19569551kgp7250912 rs1707113 rs12718026 kgp4494086 kgp2122064 kgp7342509rs1719480 rs1322231 kgp1124966 kgp2248463 kgp7455196 rs1815811 rs141748kgp10379506 kgp22760805 kgp7596010 rs2014450 rs1551858 kgp1776634kgp2292414 kgp7689305 rs2159766 rs1695770 kgp2206685 kgp2411366kgp7795918 rs2291742 rs17045 kgp428005 kgp2521934 kgp7886970 rs2327338rs17258415 kgp5395007 kgp2622558 kgp8083724 rs2646357 rs1883575kgp8585564 kgp2665310 kgp8184668 rs2829702 rs2039241 kgp9237973kgp2806942 kgp8294104 rs3008855 rs2072941 kgp7684428 kgp2894693kgp834592 rs404005 rs2317951 kgp12355117 kgp3100977 kg8579484 rs4441186rs2413786 kgp544458 kgp3195305 kgp8653740 rs4784351 rs2604913 kgp7982313kgp3325689 kgp880912 rs4851522 rs2826121 kgp10056938 kgp345319 kgp891139rs4976033 rs3744350 kgp10225481 kgp3486275 kgp0046560 rs6129182 rs384105kgp10278880 kgp3680788 kgp9130134 rs6537704 rs4509646 kgp10446645kgp3938095 kgp9488489 rs7302922 rs4912314 kgp10545153 kgp4007968kgp955389 rs7483731 rs4953915 kgp10648217 kgp4091783 kgp97489 rs7761747rs6025645 kgp10683029 kgp4237644 kgp9882080 rs8070676 rs6558873kgp10838843 kgp4403116 kgp1429225 rs876964 rs6582013 kgp10923162kgp4572570 kgp1645939 rs9396802 rs6757976 kgp1110184 kgp4679901rs1386253

FIG. 18 depicts an example diagram showing the tuning parameter profilefor the C5.0 model related to the filtered SNPs. The optimal number oftrials (iterations) for this data is 46. The distribution ofaffected/unaffected subjects based on the filtered SNPs is presented inTable 12, where Y-axis indicates predicted affected/unaffected subjectsand X-axis indicates observed affected/unaffected subjects. FIG. 19depicts an example diagram showing a ROC curve related to the filteredSNPs. FIG. 20 depicts an example diagram showing top importantpredictors among the filtered SNPs.

TABLE 12 Affected Unaffected Affected 18 6 Unaffected 17 7

6.3.1.6 Simple Ensemble Approach: Clinical Predictions, Selected andFiltered SNPs Predictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. The distribution of affected/unaffectedsubjects based on the simple ensemble approach is presented in Table 13,where Y-axis indicates predicted affected/unaffected subjects and X-axisindicates observed affected/unaffected subjects.

TABLE 13 Affected Unaffected Affected 24 6 Unaffected 11 7

FIG. 21 depicts an example diagram showing a ROC curve based on theaverage of clinical predictions and selected and filtered SNPspredictions. Table 14 includes prediction data of the simple ensembleapproach from calculation based on an average of the modelprobabilities.

TABLE 14 Observed Clinical Selected Filtered Ensemble Ensemble PredAffected 0.32 0.27 0.08 0.22 Unaffected Affected 1.00 0.12 0.17 0.43Unaffected Affected 1.00 0.75 0.08 0.61 Affected Affected 0.33 0.79 0.040.39 Unaffected Affected 0.33 0.12 0.06 0.17 Unaffected Affected 1.000.27 0.99 0.75 Affected Affected 0.00 0.79 0.99 0.59 Affected Affected0.32 0.12 0.99 0.48 Unaffected Unaffected 0.00 0.70 0.08 0.26 UnaffectedUnaffected 0.00 0.75 0.99 0.58 Affected Affected 0.40 0.27 0.04 0.23Unaffected Unaffected 0.00 0.90 0.99 0.63 Affected Affected 0.71 0.780.04 0.51 Affected Unaffected 0.32 0.78 0.04 0.38 Unaffected Unaffected0.40 0.27 0.99 0.55 Affected Affected 0.66 0.79 0.08 0.51 AffectedAffected 0.71 0.79 0.04 0.51 Affected Affected 0.40 0.12 0.99 0.50Affected Unaffected 0.29 0.78 0.12 0.40 Unaffected Affected 1.00 0.120.99 0.70 Affected Affected 0.66 0.78 0.99 0.81 Affected Affected 1.000.75 0.04 0.59 Affected Affected 0.33 0.78 0.99 0.70 Affected Affected1.00 0.78 0.99 0.92 Affected Unaffected 0.33 0.27 0.98 0.53 AffectedAffected 0.32 0.78 0.17 0.42 Unaffected Unaffected 0.33 0.27 0.99 0.53Affected Affected 0.70 0.78 0.99 0.82 Affected Affected 0.66 0.78 0.120.52 Affected Affected 0.70 0.78 0.06 0.51 Affected Affected 1.00 0.790.99 0.93 Affected Affected 1.00 0.27 0.99 0.75 Affected Affected 1.000.27 0.04 0.44 Unaffected Unaffected 0.00 0.27 0.99 0.42 UnaffectedAffected 1.00 0.79 0.06 0.62 Affected Affected 0.66 0.90 0.04 0.53Affected Affected 1.00 0.78 0.99 0.92 Affected Affected 0.33 0.78 0.170.43 Unaffected Affected 1.00 0.12 0.99 0.70 Affected Unaffected 0.650.27 0.04 0.32 Unaffected Unaffected 0.69 0.78 0.04 0.50 AffectedUnaffected 0.00 0.12 0.99 0.37 Unaffected Unaffected 0.33 0.27 0.04 0.21Unaffected Affected 0.29 0.27 0.04 0.20 Unaffected Affected 0.00 0.120.99 0.37 Unaffected Affected 1.00 0.75 0.99 0.91 Affected Affected 1.000.27 0.99 0.75 Affected Affected 0.36 0.90 0.99 0.75 Affected

6.3.1.7 Model-Based Ensemble Approach: Clinical Predictions, Selectedand Filtered SNPs Predictions

A logistic regression model is built on the hold-out datasets in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected and filtered SNPs predictions ispresented in Table 15, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. FIG. 22 depicts an example diagram showinga ROC curve based on clinical predictions and selected and filtered SNPspredictions using a model. Table 16 includes prediction data of themode-based ensemble approach from calculation based on a model of themodel probabilities.

TABLE 15 Affected Unaffected Affected 17 7 Unaffected 18 6

TABLE 16 Observed Clinical Selected Filtered Model Model Pred Affected0.32 0.27 0.08 0.49 Unaffected Affected 1.00 0.12 0.17 0.50 UnaffectedAffected 1.00 0.75 0.08 0.49 Unaffected Affected 0.33 0.79 0.04 0.49Unaffected Affected 0.33 0.12 0.06 0.49 Unaffected Affected 1.00 0.270.99 0.51 Affected Affected 0.00 0.79 0.99 0.51 Affected Affected 0.320.12 0.99 0.51 Affected Unaffected 0.00 0.70 0.08 0.49 UnaffectedUnaffected 0.00 0.75 0.99 0.51 Affected Affected 0.40 0.27 0.04 0.49Unaffected Unaffected 0.00 0.90 0.99 0.51 Affected Affected 0.71 0.780.04 0.49 Unaffected Unaffected 0.32 0.78 0.04 0.49 UnaffectedUnaffected 0.40 0.27 0.99 0.51 Affected Affected 0.66 0.79 0.08 0.49Unaffected Affected 0.71 0.79 0.04 0.49 Unaffected Affected 0.39 0.120.99 0.51 Affected Unaffected 0.29 0.78 0.12 0.49 Unaffected Affected1.00 0.12 0.99 0.51 Affected Affected 0.66 0.78 0.99 0.51 AffectedAffected 1.00 0.75 0.04 0.49 Unaffected Affected 0.33 0.78 0.99 0.51Affected Affected 1.00 0.78 0.99 0.51 Affected Unaffected 0.33 0.27 0.990.51 Affected Affected 0.32 0.78 0.17 0.50 Unaffected Unaffected 0.330.27 0.99 0.51 Affected Affected 0.70 0.78 0.99 0.51 Affected Affected0.66 0.78 0.12 0.49 Unaffected Affected 0.70 0.78 0.06 0.49 UnaffectedAffected 1.00 0.79 0.99 0.51 Affected Affected 1.00 0.27 0.99 0.51Affected Affected 1.00 0.27 0.04 0.49 Unaffected Unaffected 0.00 0.270.99 0.51 Affected Affected 1.00 0.79 0.06 0.49 Unaffected Affected 0.660.90 0.04 0.49 Unaffected Affected 1.00 0.78 0.99 0.51 Affected Affected0.32 0.78 0.17 0.50 Unaffected Affected 1.00 0.12 0.99 0.51 AffectedUnaffected 0.66 0.27 0.04 0.49 Unaffected Unaffected 0.69 0.78 0.04 0.49Unaffected Unaffected 0.00 0.12 0.99 0.51 Affected Unaffected 0.33 0.270.04 0.49 Unaffected Affected 0.29 0.27 0.04 0.49 Unaffected Affected0.00 0.12 0.99 0.51 Affected Affected 1.00 0.75 0.99 0.51 AffectedAffected 1.00 0.27 0.99 0.51 Affected Affected 0.36 0.90 0.99 0.51Affected

6.3.2 Random Forests 6.3.2.1 Clinical Predictors

FIG. 23 depicts an example diagram showing the tuning parameter profilefor the random forests model related to clinical predictors. The optimalnumber of predictors randomly selected at each split for this data is 1.The distribution of affected/unaffected subjects based on the clinicalpredictors is presented in Table 17, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects.

TABLE 17 Affected Unaffected Affected 20 3 Unaffected 15 10

FIG. 24 depicts an example diagram showing a random forest test set ROCcurve related to the clinical predictors. FIG. 25 depicts an examplediagram showing random forest top important predictors among theclinical predictors.

6.3.2.2 Selected SNPs

FIG. 26 depicts an example diagram showing the tuning parameter profilefor the random forests model related to the selected SNPs. The optimalnumber of predictors randomly selected at each iteration for this datais 10. The distribution of affected/unaffected subjects based on theselected SNPs is presented in Table 18, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects.

TABLE 18 Affected Unaffected Affected 16 7 Unaffected 19 6

FIG. 27 depicts an example diagram showing a random forest test set ROCcurve related to the selected SNPs. FIG. 28 depicts an example diagramshowing random forest top important predictors among the selected SNPs.

6.3.2.3 Simple Ensemble Approach: Clinical Predictions and Selected SNPsPredictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. FIG. 29 depicts an example diagram showing aROC curve based on the average of clinical predictions and selected SNPspredictions. The distribution of affected/unaffected subjects based onthe simple ensemble approach for the random forest model is presented inTable 19, where Y-axis indicates predicted affected/unaffected subjectsand X-axis indicates observed affected/unaffected subjects. Table 20includes prediction data of the simple ensemble approach fromcalculation based on an average of the model probabilities for randomforest models.

TABLE 19 Affected Unaffected Affected 21 4 Unaffected 14 9

TABLE 20 Observed Clinical Selected Ensemble Ensemble Pred Affected 0.360.39 0.38 Unaffected Affected 0.84 0.45 0.64 Affected Affected 0.82 0.220.52 Affected Affected 0.49 0.59 0.54 Affected Affected 0.58 0.45 0.52Affected Affected 0.62 0.26 0.44 Unaffected Affected 0.33 0.49 0.41Unaffected Affected 0.38 0.10 0.24 Unaffected Unaffected 0.26 0.72 0.49Unaffected Unaffected 0.26 0.56 0.41 Unaffected Affected 0.39 0.49 0.44Unaffected Unaffected 0.28 0.62 0.46 Unaffected Affected 0.36 0.55 0.46Unaffected Unaffected 0.42 0.51 0.46 Unaffected Unaffected 0.39 0.330.36 Unaffected Affected 0.69 0.76 0.72 Affected Affected 0.42 0.31 0.36Unaffected Affected 0.36 0.38 0.37 Unaffected Unaffected 0.34 0.79 0.56Affected Affected 0.73 0.43 0.58 Affected Affected 0.73 0.79 0.76Affected Affected 0.73 0.57 0.65 Affected Affected 0.49 0.59 0.54Affected Affected 0.57 0.61 0.59 Affected Unaffected 0.55 0.69 0.62Affected Affected 0.34 0.71 0.53 Affected Unaffected 0.48 0.40 0.44Unaffected Affected 0.46 0.54 0.50 Unaffected Affected 0.66 0.81 0.74Affected Affected 0.56 0.41 0.48 Unaffected Affected 0.46 0.33 0.39Unaffected Affected 0.72 0.24 0.48 Unaffected Affected 0.83 0.73 0.78Affected Unaffected 0.44 0.30 0.37 Unaffected Affected 0.57 0.84 0.71Affected Affected 0.67 0.78 0.72 Affected Affected 0.74 0.66 0.70Affected Affected 0.49 0.90 0.70 Affected Affected 0.72 0.39 0.56Affected Unaffected 0.65 0.23 0.44 Unaffected Unaffected 0.42 0.64 0.53Affected Unaffected 0.38 0.50 0.44 Unaffected Unaffected 0.58 0.47 0.53Affected Affected 0.34 0.47 0.40 Unaffected Affected 0.38 0.48 0.43Unaffected Affected 0.68 0.56 0.62 Affected Affected 0.83 0.35 0.59Affected Affected 0.56 0.47 0.52 Affected

6.3.2.4 Model-Based Ensemble Approach: Clinical Predictions and SelectedSNPs Predictions

A logistic regression model is built on the hold-out datasets in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected SNPs predictions from random forestmodels is presented in Table 21, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. FIG. 30 depicts an example diagram showinga ROC curve based on clinical predictions and selected SNPs predictionsusing a model based on the random forest models. Table 22 includesprediction data of the mode-based ensemble approach from calculationbased on a model of the model probabilities for the random forestmodels.

TABLE 21 Affected Unaffected Affected 12 10 Unaffected 23 3

TABLE 22 Observed Clinical Selected Model Model Pred Affected 0.36 0.390.56 Affected Affected 0.84 0.44 0.40 Unaffected Affected 0.82 0.22 0.43Unaffected Affected 0.49 0.59 0.50 Unaffected Affected 0.58 0.45 0.48Unaffected Affected 0.62 0.26 0.49 Unaffected Affected 0.33 0.50 0.56Affected Affected 0.38 0.10 0.59 Affected Unaffected 0.26 0.72 0.56Affected Unaffected 0.26 0.56 0.58 Affected Affected 0.39 0.50 0.54Affected Unaffected 0.28 0.62 0.56 Affected Affected 0.36 0.55 0.54Affected Unaffected 0.42 0.50 0.53 Affected Unaffected 0.38 0.33 0.56Affected Affected 0.70 0.76 0.41 Unaffected Affected 0.42 0.31 0.55Affected Affected 0.36 0.38 0.56 Affected Unaffected 0.34 0.78 0.52Affected Affected 0.74 0.43 0.43 Unaffected Affected 0.74 0.78 0.39Unaffected Affected 0.73 0.57 0.42 Unaffected Affected 0.49 0.59 0.50Unaffected Affected 0.57 0.62 0.46 Unaffected Unaffected 0.55 0.69 0.46Unaffected Affected 0.34 0.71 0.53 Affected Unaffected 0.48 0.40 0.52Affected Affected 0.46 0.54 0.51 Affected Affected 0.66 0.82 0.41Unaffected Affected 0.56 0.41 0.50 Unaffected Affected 0.46 0.32 0.54Affected Affected 0.72 0.24 0.46 Unaffected Affected 0.84 0.73 0.37Unaffected Unaffected 0.44 0.30 0.55 Affected Affected 0.57 0.84 0.44Unaffected Affected 0.67 0.78 0.41 Unaffected Affected 0.74 0.66 0.41Unaffected Affected 0.50 0.90 0.46 Unaffected Affected 0.72 0.39 0.44Unaffected Unaffected 0.65 0.23 0.49 Unaffected Unaffected 0.42 0.640.51 Affected Unaffected 0.38 0.50 0.55 Affected Unaffected 0.58 0.470.48 Unaffected Affected 0.34 0.47 0.56 Affected Affected 0.38 0.48 0.55Affected Affected 0.68 0.56 0.44 Unaffected Affected 0.84 0.36 0.41Unaffected Affected 0.56 0.48 0.49 Unaffected

6.3.2.5 Filtered SNPs

FIG. 31 depicts an example diagram showing the tuning parameter profilefor the random forest model related to the filtered SNPs. Thecross-validated ROC is high in FIG. 31, indicating that it may beover-fitting, likely due to a high proportion of irrelevant predictors.The optimal number of trials (iterations) for this data is 10. Thedistribution of affected/unaffected subjects based on the filtered SNPsis presented in Table 23, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. FIG. 32 depicts an example diagram showinga random forest test set ROC curve related to the filtered SNPs. FIG. 33depicts an example diagram showing random forest top importantpredictors among the filtered SNPs.

TABLE 23 Affected Unaffected Affected 25 12 Unaffected 10 1

6.3.2.6 Simple Ensemble Approach: Clinical Predictions, Selected andFiltered SNPs Predictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. The distribution of affected/unaffectedsubjects based on the average of clinical predictions and selected andfiltered SNPs predictions for random forest models is presented in Table24, where Y-axis indicates predicted affected/unaffected subjects andX-axis indicates observed affected/unaffected subjects.

TABLE 24 Affected Unaffected Affected 23 7 Unaffected 12 6

FIG. 34 depicts an example diagram showing a ROC curve based on theaverage of clinical predictions and selected and filtered SNPspredictions for random forest models. Table 25 includes prediction dataof the simple ensemble approach from calculation based on an average ofthe model probabilities.

TABLE 25 Observed Clinical Selected Filtered Ensemble Ensemble PredAffected 0.36 0.39 0.46 0.40 Unaffected Affected 0.84 0.45 0.57 0.62Affected Affected 0.82 0.22 0.48 0.51 Affected Affected 0.49 0.59 0.460.51 Affected Affected 0.58 0.45 0.50 0.51 Affected Affected 0.62 0.260.59 0.49 Unaffected Affected 0.33 0.49 0.53 0.45 Unaffected Affected0.38 0.10 0.58 0.35 Unaffected Unaffected 0.26 0.72 0.51 0.49 UnaffectedUnaffected 0.26 0.56 0.51 0.44 Unaffected Affected 0.39 0.49 0.57 0.48Unaffected Unaffected 0.28 0.62 0.64 0.52 Affected Affected 0.36 0.550.52 0.47 Unaffected Unaffected 0.42 0.51 0.54 0.49 UnaffectedUnaffected 0.39 0.33 0.46 0.39 Unaffected Affected 0.69 0.76 0.49 0.65Affected Affected 0.42 0.31 0.50 0.41 Unaffected Affected 0.36 0.38 0.470.40 Unaffected Unaffected 0.34 0.79 0.46 0.53 Affected Affected 0.720.43 0.65 0.60 Affected Affected 0.73 0.79 0.60 0.71 Affected Affected0.73 0.57 0.55 0.61 Affected Affected 0.49 0.59 0.68 0.59 AffectedAffected 0.57 0.61 0.58 0.59 Affected Unaffected 0.55 0.69 0.54 0.59Affected Affected 0.34 0.71 0.59 0.55 Affected Unaffected 0.48 0.40 0.640.51 Affected Affected 0.46 0.54 0.64 0.54 Affected Affected 0.66 0.810.64 0.70 Affected Affected 0.56 0.41 0.48 0.48 Unaffected Affected 0.460.33 0.54 0.44 Unaffected Affected 0.72 0.24 0.67 0.54 Affected Affected0.82 0.73 0.47 0.68 Affected Unaffected 0.44 0.30 0.60 0.45 UnaffectedAffected 0.57 0.84 0.56 0.66 Affected Affected 0.67 0.78 0.53 0.66Affected Affected 0.74 0.66 0.63 0.68 Affected Affected 0.49 0.90 0.550.65 Affected Affected 0.72 0.39 0.65 0.59 Affected Unaffected 0.65 0.230.53 0.47 Unaffected Unaffected 0.42 0.64 0.53 0.53 Affected Unaffected0.38 0.50 0.66 0.51 Affected Unaffected 0.58 0.47 0.56 0.54 AffectedAffected 0.34 0.47 0.51 0.44 Unaffected Affected 0.38 0.48 0.62 0.49Unaffected Affected 0.68 0.56 0.46 0.57 Affected Affected 0.83 0.35 0.500.56 Affected Affected 0.56 0.47 0.50 0.51 Affected

6.3.2.7 Model-Based Ensemble Approach: Clinical Predictions, Selectedand Filtered SNPs Predictions

A logistic regression model is built on the hold-out predictions in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected and filtered SNPs predictions for therandom forest models is presented in Table 26, where Y-axis indicatespredicted affected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. FIG. 35 depicts an example diagram showinga ROC curve based on clinical predictions and selected and filtered SNPspredictions using a model for the random forest models. Table 27includes prediction data of the mode-based ensemble approach fromcalculation based on a model of the model probabilities for the randomforest models.

TABLE 26 Affected Unaffected Affected 20 9 Unaffected 15 4

TABLE 27 Observed Clinical Selected Filtered Model Model Pred Affected0.36 0.39 0.46 0.00 Unaffected Affected 0.84 0.45 0.57 1.00 AffectedAffected 0.82 0.22 0.48 0.00 Unaffected Affected 0.49 0.59 0.46 0.00Unaffected Affected 0.58 0.45 0.50 0.00 Unaffected Affected 0.62 0.260.59 1.00 Affected Affected 0.33 0.50 0.53 0.01 Unaffected Affected 0.380.10 0.58 1.00 Affected Unaffected 0.26 0.72 0.50 0.00 UnaffectedUnaffected 0.26 0.56 0.51 0.00 Unaffected Affected 0.39 0.50 0.57 1.00Affected Unaffected 0.28 0.62 0.64 1.00 Affected Affected 0.36 0.55 0.520.00 Unaffected Unaffected 0.42 0.50 0.54 1.00 Affected Unaffected 0.380.33 0.46 0.00 Unaffected Affected 0.70 0.76 0.50 0.00 UnaffectedAffected 0.42 0.31 0.50 0.00 Unaffected Affected 0.36 0.38 0.46 0.00Unaffected Unaffected 0.34 0.78 0.46 0.00 Unaffected Affected 0.74 0.430.65 1.00 Affected Affected 0.74 0.78 0.60 1.00 Affected Affected 0.730.57 0.55 1.00 Affected Affected 0.49 0.59 0.68 1.00 Affected Affected0.57 0.62 0.58 1.00 Affected Unaffected 0.55 0.69 0.54 1.00 AffectedAffected 0.34 0.71 0.59 1.00 Affected Unaffected 0.48 0.40 0.64 1.00Affected Affected 0.46 0.54 0.64 1.00 Affected Affected 0.66 0.82 0.641.00 Affected Affected 0.56 0.41 0.48 0.00 Unaffected Affected 0.46 0.320.54 1.00 Affected Affected 0.72 0.24 0.66 1.00 Affected Affected 0.840.73 0.47 0.00 Unaffected Unaffected 0.44 0.30 0.60 1.00 AffectedAffected 0.57 0.84 0.56 1.00 Affected Affected 0.67 0.78 0.52 1.00Affected Affected 0.74 0.66 0.63 1.00 Affected Affected 0.50 0.90 0.551.00 Affected Affected 0.72 0.39 0.65 1.00 Affected Unaffected 0.65 0.230.53 1.00 Affected Unaffected 0.42 0.64 0.53 0.67 Affected Unaffected0.38 0.50 0.66 1.00 Affected Unaffected 0.58 0.47 0.56 1.00 AffectedAffected 0.34 0.47 0.51 0.00 Unaffected Affected 0.38 0.48 0.62 1.00Affected Affected 0.68 0.56 0.46 0.00 Unaffected Affected 0.84 0.36 0.500.00 Unaffected Affected 0.56 0.48 0.50 0.00 Unaffected

6.3.3 Penalized Logistic Regression 6.3.3.1 Clinical Predictors

FIG. 36 depicts an example diagram showing the tuning parameter profilefor the penalized logistic model related to clinical predictors. Theoptimal mixing percentage for this data is 0.1 and the optimalregularization parameter is 0.3. The distribution of affected/unaffectedsubjects for the penalized logistic regression model is presented inTable 28, where Y-axis indicates predicted affected/unaffected subjectsand X-axis indicates observed affected/unaffected subjects.

TABLE 28 Affected Unaffected Affected 21 2 Unaffected 14 11

FIG. 37 depicts an example diagram showing a penalized logisticregression test set ROC curve related to the clinical predictors. FIG.38 depicts an example diagram showing penalized logistic regression topimportant predictors among the clinical predictors.

6.3.3.2 Selected SNPs

FIG. 39 depicts an example diagram showing the tuning parameter profilefor the penalized logistic regression model related to the selectedSNPs. The optimal mixing percentage for this data is 0.2 and the optimalregularization parameter is 0.01. The distribution ofaffected/unaffected subjects based on the selected SNPs for thepenalized logistic regression model is presented in Table 29, whereY-axis indicates predicted affected/unaffected subjects and X-axisindicates observed affected/unaffected subjects.

TABLE 29 Affected Unaffected Affected 19 10 Unaffected 16 3

FIG. 40 depicts an example diagram showing a penalized logisticregression test set ROC curve related to the selected SNPs. FIG. 41depicts an example diagram showing penalized logistic regression topimportant predictors among the selected SNPs.

6.3.3.3 Simple Ensemble Approach: Clinical Predictions and Selected SNPsPredictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. FIG. 42 depicts an example diagram showing aROC curve based on the average of clinical predictions and selected SNPspredictions for the penalized logistic regression model. Thedistribution of affected/unaffected subjects based on the simpleensemble approach for the penalized logistic regression model ispresented in Table 30, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects. Table 31 includes prediction data of thesimple ensemble approach from calculation based on an average of themodel probabilities for penalized logistic regression models.

TABLE 30 Affected Unaffected Affected 21 7 Unaffected 14 6

TABLE 31 Observed Clinical Selected Ensemble Ensemble Pred Affected 0.460.47 0.47 Unaffected Affected 0.57 0.60 0.58 Affected Affected 0.52 0.230.37 Unaffected Affected 0.51 0.49 0.50 Affected Affected 0.47 0.48 0.47Unaffected Affected 0.55 0.42 0.48 Unaffected Affected 0.46 0.18 0.32Unaffected Affected 0.46 0.54 0.50 Unaffected Unaffected 0.34 0.51 0.42Unaffected Unaffected 0.33 0.58 0.46 Unaffected Affected 0.48 0.47 0.48Unaffected Unaffected 0.44 0.51 0.47 Unaffected Affected 0.48 0.68 0.58Affected Unaffected 0.47 0.06 0.27 Unaffected Unaffected 0.47 0.53 0.50Affected Affected 0.51 0.43 0.47 Unaffected Affected 0.50 0.23 0.37Unaffected Affected 0.44 0.22 0.33 Unaffected Unaffected 0.44 0.64 0.54Affected Affected 0.57 0.54 0.55 Affected Affected 0.50 0.54 0.52Affected Affected 0.58 0.47 0.52 Affected Affected 0.51 0.56 0.53Affected Affected 0.51 0.71 0.61 Affected Unaffected 0.46 0.61 0.53Affected Affected 0.50 0.52 0.51 Affected Unaffected 0.51 0.56 0.53Affected Affected 0.44 0.65 0.54 Affected Affected 0.53 0.59 0.56Affected Affected 0.51 0.59 0.55 Affected Affected 0.48 0.22 0.35Unaffected Affected 0.57 0.47 0.52 Affected Affected 0.56 0.56 0.56Affected Unaffected 0.51 0.46 0.49 Unaffected Affected 0.48 0.57 0.53Affected Affected 0.52 0.59 0.55 Affected Affected 0.56 0.58 0.57Affected Affected 0.50 0.62 0.56 Affected Affected 0.55 0.20 0.38Unaffected Unaffected 0.50 0.17 0.33 Unaffected Unaffected 0.48 0.580.53 Affected Unaffected 0.45 0.64 0.54 Affected Unaffected 0.45 0.610.53 Affected Affected 0.48 0.61 0.54 Affected Affected 0.46 0.63 0.54Affected Affected 0.49 0.42 0.46 Unaffected Affected 0.51 0.42 0.46Unaffected Affected 0.58 0.57 0.58 Affected

6.3.3.4 Model-Based Ensemble Approach: Clinical Predictions and SelectedSNPs Predictions

A logistic regression model is built on the hold-out datasets in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected SNPs predictions from penalizedlogistic regression models is presented in Table 32, where Y-axisindicates predicted affected/unaffected subjects and X-axis indicatesobserved affected/unaffected subjects. FIG. 43 depicts an examplediagram showing a ROC curve based on clinical predictions and selectedSNPs predictions using a model based on the penalized logisticregression models. Table 33 includes prediction data of the mode-basedensemble approach from calculation based on a model of the modelprobabilities for the penalized logistic regression models.

TABLE 32 Affected Unaffected Affected 15 11 Unaffected 20 2

TABLE 33 Observed Clinical Selected Model Model Pred Affected 0.46 0.470.60 Affected Affected 0.57 0.60 0.33 Unaffected Affected 0.52 0.23 0.64Affected Affected 0.51 0.49 0.50 Unaffected Affected 0.47 0.48 0.58Affected Affected 0.55 0.42 0.48 Unaffected Affected 0.46 0.18 0.76Affected Affected 0.46 0.54 0.56 Affected Unaffected 0.34 0.51 0.77Affected Unaffected 0.33 0.58 0.74 Affected Affected 0.48 0.48 0.57Affected Unaffected 0.44 0.50 0.61 Affected Affected 0.48 0.68 0.43Unaffected Unaffected 0.47 0.06 0.79 Affected Unaffected 0.47 0.53 0.54Affected Affected 0.51 0.43 0.54 Affected Affected 0.50 0.23 0.66Affected Affected 0.44 0.22 0.76 Affected Unaffected 0.44 0.64 0.53Affected Affected 0.57 0.54 0.37 Unaffected Affected 0.50 0.54 0.47Unaffected Affected 0.58 0.47 0.39 Unaffected Affected 0.51 0.56 0.46Unaffected Affected 0.51 0.71 0.36 Unaffected Unaffected 0.46 0.61 0.52Affected Affected 0.50 0.52 0.49 Unaffected Unaffected 0.51 0.56 0.45Unaffected Affected 0.44 0.65 0.53 Affected Affected 0.53 0.59 0.41Unaffected Affected 0.51 0.59 0.44 Unaffected Affected 0.48 0.22 0.71Affected Affected 0.57 0.47 0.40 Unaffected Affected 0.56 0.57 0.36Unaffected Unaffected 0.51 0.46 0.51 Affected Affected 0.48 0.58 0.49Unaffected Affected 0.52 0.59 0.42 Unaffected Affected 0.56 0.58 0.36Unaffected Affected 0.50 0.62 0.44 Unaffected Affected 0.55 0.20 0.60Affected Unaffected 0.50 0.17 0.70 Affected Unaffected 0.48 0.58 0.49Unaffected Unaffected 0.45 0.64 0.51 Affected Unaffected 0.45 0.61 0.54Affected Affected 0.48 0.61 0.48 Unaffected Affected 0.45 0.63 0.51Affected Affected 0.50 0.42 0.56 Affected Affected 0.51 0.42 0.55Affected Affected 0.58 0.58 0.32 Unaffected

6.3.3.5 Filtered SNPs

FIG. 44 depicts an example diagram showing the tuning parameter profilefor the penalized logistic regression model related to the filteredSNPs. The cross-validated ROC is high in FIG. 44, indicating that it maybe over-fitting, likely due to a high proportion of irrelevantpredictors. The optimal mixing percentage for this data is 0.05 and theoptimal regularization parameter is 0.1. The distribution ofaffected/unaffected subjects based on the filtered SNPs for thepenalized logistic regression model is presented in Table 34, whereY-axis indicates predicted affected/unaffected subjects and X-axisindicates observed affected/unaffected subjects. FIG. 45 depicts anexample diagram showing a penalized logistic regression test set ROCcurve related to the filtered SNPs. FIG. 46 depicts an example diagramshowing penalized logistic regression top important predictors among thefiltered SNPs.

TABLE 34 Affected Unaffected Affected 26 10 Unaffected 9 3

6.3.3.6 Simple Ensemble Approach: Clinical Predictions, Selected andFiltered SNPs Predictions

A simple ensemble approach is applied to determine a simple average oftest set predicted probabilities to classify subjects into the affectedand unaffected categories. The distribution of affected/unaffectedsubjects based on the average of clinical predictions and selected andfiltered SNPs predictions for penalized logistic regression models ispresented in Table 35, where Y-axis indicates predictedaffected/unaffected subjects and X-axis indicates observedaffected/unaffected subjects.

TABLE 35 Affected Unaffected Affected 27 9 Unaffected 8 4

FIG. 47 depicts an example diagram showing a ROC curve based on theaverage of clinical predictions and selected and filtered SNPspredictions for penalized logistic regression models. Table 36 includesprediction data of the simple ensemble approach from calculation basedon an average of the model probabilities for penalized logisticregression models.

TABLE 36 Observed Clinical Selected Filtered Ensemble Ensemble PredAffected 0.46 0.47 0.63 0.52 Affected Affected 0.57 0.60 0.86 0.68Affected Affected 0.52 0.23 0.39 0.38 Unaffected Affected 0.51 0.49 0.540.52 Affected Affected 0.47 0.48 0.35 0.43 Unaffected Affected 0.55 0.420.75 0.57 Affected Affected 0.46 0.18 0.59 0.41 Unaffected Affected 0.460.54 0.81 0.60 Affected Unaffected 0.34 0.51 0.60 0.48 UnaffectedUnaffected 0.33 0.58 0.60 0.51 Affected Affected 0.48 0.47 0.57 0.51Affected Unaffected 0.44 0.51 0.89 0.61 Affected Affected 0.48 0.68 0.720.63 Affected Unaffected 0.47 0.06 0.39 0.31 Unaffected Unaffected 0.470.53 0.33 0.44 Unaffected Affected 0.51 0.43 0.41 0.45 UnaffectedAffected 0.50 0.23 0.64 0.46 Unaffected Affected 0.44 0.22 0.55 0.40Unaffected Unaffected 0.44 0.64 0.47 0.52 Affected Affected 0.57 0.540.91 0.67 Affected Affected 0.50 0.54 0.92 0.66 Affected Affected 0.580.47 0.75 0.60 Affected Affected 0.51 0.56 0.96 0.68 Affected Affected0.51 0.71 0.95 0.73 Affected Unaffected 0.46 0.61 0.47 0.51 AffectedAffected 0.50 0.52 0.74 0.59 Affected Unaffected 0.51 0.56 0.62 0.56Affected Affected 0.44 0.65 0.83 0.64 Affected Affected 0.53 0.59 0.890.67 Affected Affected 0.51 0.59 0.54 0.55 Affected Affected 0.48 0.220.86 0.52 Affected Affected 0.57 0.47 0.74 0.59 Affected Affected 0.560.56 0.41 0.51 Affected Unaffected 0.51 0.46 0.60 0.52 Affected Affected0.48 0.57 0.49 0.52 Affected Affected 0.52 0.59 0.39 0.50 AffectedAffected 0.56 0.58 0.96 0.70 Affected Affected 0.50 0.62 0.82 0.64Affected Affected 0.55 0.20 0.76 0.50 Affected Unaffected 0.50 0.17 0.710.46 Unaffected Unaffected 0.48 0.58 0.60 0.56 Affected Unaffected 0.450.64 0.90 0.66 Affected Unaffected 0.45 0.61 0.50 0.52 Affected Affected0.48 0.61 0.62 0.57 Affected Affected 0.56 0.63 0.89 0.66 AffectedAffected 0.49 0.42 0.76 0.56 Affected Affected 0.51 0.42 0.57 0.50Unaffected Affected 0.58 0.57 0.20 0.45 Unaffected

6.3.3.7 Model-Based Ensemble Approach: Clinical Predictions, Selectedand Filtered SNPs Predictions

A logistic regression model is built on the hold-out datasets in theprevious training step using the optimal tuning parameters. The logisticregression model is then applied to the test set. The distribution ofaffected/unaffected subjects based on a cross-validated model using theclinical predictions and selected and filtered SNPs predictions for thepenalized logistic regression models is presented in Table 37, whereY-axis indicates predicted affected/unaffected subjects and X-axisindicates observed affected/unaffected subjects. FIG. 48 depicts anexample diagram showing a ROC curve based on clinical predictions andselected and filtered SNPs predictions using a model for the penalizedlogistic regression models. Table 38 includes prediction data of themode-based ensemble approach from calculation based on a model of themodel probabilities for the penalized logistic regression models.

TABLE 37 Affected Unaffected Affected 28 8 Unaffected 7 5

TABLE 38 Observed Clinical Selected Filtered Model Model Pred Affected0.46 0.47 0.63 0.51 Affected Affected 0.57 0.60 0.86 0.51 AffectedAffected 0.52 0.23 0.39 0.50 Unaffected Affected 0.51 0.49 0.54 0.50Affected Affected 0.47 0.48 0.35 0.49 Unaffected Affected 0.55 0.42 0.750.51 Affected Affected 0.46 0.18 0.59 0.50 Affected Affected 0.46 0.540.81 0.51 Affected Unaffected 0.34 0.51 0.60 0.50 Affected Unaffected0.33 0.58 0.60 0.50 Affected Affected 0.48 0.48 0.57 0.50 AffectedUnaffected 0.44 0.50 0.88 0.62 Affected Affected 0.48 0.68 0.72 0.51Affected Unaffected 0.47 0.06 0.39 0.50 Unaffected Unaffected 0.47 0.530.33 0.49 Unaffected Affected 0.51 0.43 0.41 0.50 Unaffected Affected0.50 0.23 0.64 0.51 Affected Affected 0.44 0.22 0.55 0.50 AffectedUnaffected 0.44 0.64 0.47 0.50 Unaffected Affected 0.57 0.54 0.91 0.52Affected Affected 0.50 0.54 0.92 0.52 Affected Affected 0.58 0.47 0.750.51 Affected Affected 0.51 0.56 0.96 0.52 Affected Affected 0.51 0.710.95 0.52 Affected Unaffected 0.46 0.61 0.47 0.50 Unaffected Affected0.50 0.52 0.74 0.51 Affected Unaffected 0.51 0.56 0.62 0.50 AffectedAffected 0.44 0.65 0.83 0.51 Affected Affected 0.53 0.59 0.89 0.52Affected Affected 0.51 0.59 0.54 0.50 Affected Affected 0.48 0.22 0.860.51 Affected Affected 0.57 0.47 0.74 0.51 Affected Affected 0.56 0.570.41 0.50 Unaffected Unaffected 0.51 0.46 0.60 0.50 Affected Affected0.48 0.58 0.49 0.50 Unaffected Affected 0.52 0.59 0.39 0.50 UnaffectedAffected 0.56 0.58 0.96 0.52 Affected Affected 0.50 0.62 0.82 0.51Affected Affected 0.55 0.20 0.76 0.51 Affected Unaffected 0.50 0.17 0.710.51 Affected Unaffected 0.48 0.58 0.60 0.50 Affected Unaffected 0.450.64 0.90 0.52 Affected Unaffected 0.45 0.61 0.50 0.50 UnaffectedAffected 0.48 0.61 0.62 0.50 Affected Affected 0.45 0.63 0.89 0.52Affected Affected 0.50 0.42 0.76 0.51 Affected Affected 0.51 0.42 0.570.50 Affected Affected 0.58 0.58 0.20 0.49 Unaffected

It should be understood that the above description only disclosesseveral scenarios presented by this invention, and the description isrelatively specific and detailed, yet it cannot therefore be understoodas limiting the scope of this invention's patent. It should be notedthat ordinary technicians in the field may also, without deviating fromthe invention's conceptual premises, make a number of variations andmodifications, which are all within the scope of this invention. As aresult, in terms of protection, the patent claims shall prevail.

For example, some or all components of various embodiments of thepresent invention each are, individually and/or in combination with atleast another component, implemented using one or more softwarecomponents, one or more hardware components, and/or one or morecombinations of software and hardware components. In another example,some or all components of various embodiments of the present inventioneach are, individually and/or in combination with at least anothercomponent, implemented in one or more circuits, such as one or moreanalog circuits and/or one or more digital circuits. In yet anotherexample, various embodiments and/or examples of the present inventioncan be combined.

Additionally, the methods and systems described herein may beimplemented on many different types of processing devices by programcode comprising program instructions that are executable by the deviceprocessing subsystem. The software program instructions may includesource code, object code, machine code, or any other stored data that isoperable to cause a processing system to perform the methods andoperations described herein. Other implementations may also be used,however, such as firmware or even appropriately designed hardwareconfigured to carry out the methods and systems described herein.

The systems' and methods' data (e.g., associations, mappings, datainput, data output, intermediate data results, final data results, etc.)may be stored and implemented in one or more different types ofcomputer-implemented data stores, such as different types of storagedevices and programming constructs (e.g., RAM, ROM, Flash memory, flatfiles, databases, programming data structures, programming variables,IF-THEN (or similar type) statement constructs, etc.). It is noted thatdata structures describe formats for use in organizing and storing datain databases, programs, memory, or other computer-readable media for useby a computer program.

The systems and methods may be provided on many different types ofcomputer-readable media including computer storage mechanisms (e.g.,CD-ROM, diskette, RAM, flash memory, computer's hard drive, etc.) thatcontain instructions (e.g., software) for use in execution by aprocessor to perform the methods' operations and implement the systemsdescribed herein.

The computer components, software modules, functions, data stores anddata structures described herein may be connected directly or indirectlyto each other in order to allow the flow of data needed for theiroperations. It is also noted that a module or processor includes but isnot limited to a unit of code that performs a software operation, andcan be implemented for example as a subroutine unit of code, or as asoftware function unit of code, or as an object (as in anobject-oriented paradigm), or as an applet, or in a computer scriptlanguage, or as another type of computer code. The software componentsand/or functionality may be located on a single computer or distributedacross multiple computers depending upon the situation at hand.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specifics, these should not beconstrued as limitations on the scope or of what may be claimed, butrather as descriptions of features specific to particular embodiments.Certain features that are described in this specification in the contextor separate embodiments can also be implemented in combination in asingle embodiment. Conversely, various features that are described inthe context of a single embodiment can also be implemented in multipleembodiments separately or in any suitable subcombination. Moreover,although features may be described above as acting in certaincombinations and even initially claimed as such, one or more featuresfrom a claimed combination can in some cases be excised from thecombination, and the claimed combination may be directed to asubcombination or variation of a sub combination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

Although specific embodiments of the present invention have beendescribed, it will be understood by those of skill in the art that thereare other embodiments that are equivalent to the described embodiments.Accordingly, it is to be understood that the invention is not to belimited by the specific illustrated embodiments, but only by the scopeof the appended claims.

It is claimed:
 1. A processor-implemented method for predictingregimen-related outcomes, the method comprising: accessing a predictivemodel that determines likelihoods of each of a plurality side effectsfor each of a plurality of treatment regimens based on one or moresingle-nucleotide polymorphisms (SNPs) associated with a patient;predicting, using one or more data processors, regimen-related outcomesincluding side effects using the predictive model including likelihoodsfor each of the plurality of side effects; providing a first interfacefor receiving indications of patient tolerances for side effects,wherein a numerical value is assigned to each of the plurality of sideeffects based on the received indications; and providing a secondinterface that identifies the likelihoods for each of the plurality ofside effects, wherein a treatment regimen for the patient is determinedbased on the likelihoods for each of the plurality of side effects andthe numerical values assigned for each of the patient tolerances forside effects.
 2. The method of claim 1, further comprising: generating,using the one or more data processors, one or more training datasets andone or more testing datasets based at least in part on clinical data andgene feature data of a plurality of patients, the gene feature dataincluding data related to one or more single-nucleotide polymorphisms(SNPs), and the clinical data; determining, using one or more dataprocessors, one or more initial predictive models using one or moremachine learning algorithms based at least in part on the one or moretraining datasets; applying, using the one or more data processors, theone or more initial predictive models on the one or more trainingdatasets to generate result data; performing, using the one or more dataprocessors, an ensemble algorithm on the result data to generateensemble data; determining, using the one or more data processors, oneor more final predictive models based at least in part on the ensembledata; evaluating, using the one or more data processors, performance ofthe one or more final predictive models based at least in part on theone or more test datasets.
 3. The method of claim 2, wherein generatingone or more training datasets and one or more testing datasets based atleast in part on clinical data and gene feature data of a plurality ofpatients includes: determining a plurality of SNPs; filtering theplurality of SNPs to determine one or more filtered SNPs; determiningthe gene feature data based at least in part on the one or more filteredSNPs.
 4. The method of claim 3, wherein filtering the plurality of SNPsusing a recursive partitioning operation for filtering by: dividing thegene feature dataset related to the plurality of SNPs into a pluralityof sub-datasets; selecting one or more first sub-datasets from theplurality of sub-datasets; developing a first recursive partitioningmodel based at least in part on the one or more first sub-datasets;determining one or more first predictive SNPs based at least in part onthe first recursive partitioning model, wherein the one or more firstpredictive SNPs are included into the one or more filtered SNPs;selecting one or more second sub-datasets from the plurality ofsub-datasets; developing a second recursive partitioning model based atleast in part on the one or more second sub-datasets; and determiningone or more second predictive SNPs based at least in part on the secondrecursive partitioning model, wherein the one or more second predictiveSNPs are included into the one or more filtered SNPs; and
 5. The methodof claim 3, wherein generating one or more training datasets and one ormore testing datasets based at least in part on clinical data or genefeature data of a plurality of patients includes: determining the genefeature data based at least in part on one or more predetermined SNPs.6. The method of claim 3, wherein filtering the plurality of SNPs todetermine the one or more filtered SNPs includes: removing a number ofSNPs based on missing data from the plurality of SNPs.
 7. The method ofclaim 3, wherein filtering the plurality of SNPs to determine the one ormore filtered SNPs includes: removing one or more SNPs that areassociated from the plurality of SNPs.
 8. The method of claim 2, whereinthe one or more machine learning algorithms correspond to one or more ofthe following: a penalized logistic regression algorithm, a randomforests algorithm, and a C5.0 algorithm.
 9. The method of claim 2,wherein generating one or more training datasets and one or more testingdatasets based at least in part on clinical data or gene feature data ofa plurality of patients includes: generating one or more clinicalpredictor datasets based at least in part on the clinical data; andgenerating one or more gene feature datasets based at least in part onthe gene feature data.
 10. The method of claim 9, wherein applying theone or more initial predictive models on the one or more trainingdatasets to generate result data includes: applying the initialpredictive models on the one or more clinical predictor datasets togenerate clinical result data; and applying the initial predictivemodels on the one or more gene feature datasets to generate gene featureresult data.
 11. The method of claim 2, wherein the ensemble algorithmcorresponds to an average calculation or a logistic regressionalgorithm.
 12. The method of claim 2, wherein generating one or moretraining datasets and one or more testing datasets based at least inpart on clinical data or gene feature data of a plurality of patientsincludes: generating one or more clinical predictor datasets bygenerating binary predictor data based at least in part on the clinicaldata.
 13. The method of claim 2, further comprising: performing 10-foldcross-validation on the one or more training datasets to determine oneor more tuning parameters for the initial predictive models.
 14. Themethod of claim 1, wherein the first interface displays a scale runningfrom perfect health to death.
 15. The method of claim 11, wherein lowmagnitude numerical values are assigned to less tolerable side effects,where a magnitude of zero is associated with death.
 16. The method ofclaim 11, wherein the second interface further displays a scale runningfrom perfect health to death that includes the plurality of sideeffects, the plurality of side effects being positioned on the scaleaccording to the numerical values assigned to each of the plurality ofside effects.
 17. A computer-implemented system for predictingregimen-related outcomes, comprising: one or more data processors; oneor more non-transitory computer-readable mediums encoded withinstructions for commanding the one or more data processors to executesteps of a method that includes: accessing a predictive model thatdetermines likelihoods of each of a plurality side effects for each of aplurality of treatment regimens based on one or more single-nucleotidepolymorphisms (SNPs) associated with a patient; predicting, using one ormore data processors, regimen-related outcomes including side effectsusing the predictive model including likelihoods for each of theplurality of side effects; providing a first interface for receivingindications of patient tolerances for side effects, wherein a numericalvalue is assigned to each of the plurality of side effects based on thereceived indications; and providing a second interface that identifiesthe likelihoods for each of the plurality of side effects, wherein atreatment regimen for the patient is determined based on the likelihoodsfor each of the plurality of side effects and the numerical valuesassigned for each of the patient tolerances for side effects.
 18. Thesystem of claim 17, wherein the method further comprises: generating,using the one or more data processors, one or more training datasets andone or more testing datasets based at least in part on clinical data andgene feature data of a plurality of patients, the gene feature dataincluding data related to one or more single-nucleotide polymorphisms(SNPs), and the clinical data; determining, using one or more dataprocessors, one or more initial predictive models using one or moremachine learning algorithms based at least in part on the one or moretraining datasets; applying, using the one or more data processors, theone or more initial predictive models on the one or more trainingdatasets to generate result data; performing, using the one or more dataprocessors, an ensemble algorithm on the result data to generateensemble data; determining, using the one or more data processors, oneor more final predictive models based at least in part on the ensembledata; evaluating, using the one or more data processors, performance ofthe one or more final predictive models based at least in part on theone or more test datasets.
 19. The system of claim 18, whereingenerating one or more training datasets and one or more testingdatasets based at least in part on clinical data and gene feature dataof a plurality of patients includes: determining a plurality of SNPs;filtering the plurality of SNPs to determine one or more filtered SNPs;determining the gene feature data based at least in part on the one ormore filtered SNPs.
 20. A non-transitory computer-readable mediumencoded with instructions for commanding one or more data processors toexecute steps of a method for predicting regimen-related outcomes, themethod comprising: accessing a predictive model that determineslikelihoods of each of a plurality side effects for each of a pluralityof treatment regimens based on one or more single-nucleotidepolymorphisms (SNPs) associated with a patient; predicting, using one ormore data processors, regimen-related outcomes including side effectsusing the predictive model including likelihoods for each of theplurality of side effects; providing a first interface for receivingindications of patient tolerances for side effects, wherein a numericalvalue is assigned to each of the plurality of side effects based on thereceived indications; and providing a second interface that identifiesthe likelihoods for each of the plurality of side effects, wherein atreatment regimen for the patient is determined based on the likelihoodsfor each of the plurality of side effects and the numerical valuesassigned for each of the patient tolerances for side effects.